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Preface 



The 9th International Conference on Extending Database Technology, EDBT 
2004, was held in Heraklion, Crete, Greece, during March 14-18, 2004. The 
EDBT series of conferences is an established and prestigious forum for the 
exchange of the latest research results in data management. Held every two 
years in an attractive European location, the conference provides unique oppor- 
tunities for database researchers, practitioners, developers, and users to explore 
new ideas, techniques, and tools, and to exchange experiences. The previous 
events were held in Venice, Vienna, Cambridge, Avignon, Valencia, Konstanz, 
and Prague. 

EDBT 2004 had the theme “new challenges for database technology,” with 
the goal of encouraging researchers to take a greater interest in the current 
exciting technological and application advancements and to devise and address 
new research and development directions for database technology. From its early 
days, database technology has been challenged and advanced by new uses and 
applications, and it continues to evolve along with application requirements and 
hardware advances. Today’s DBMS technology faces yet several new challenges. 
Technological trends and new computation paradigms, and applications such 
as pervasive and ubiquitous computing, grid computing, bioinformatics, trust 
management, virtual communities, and digital asset management, to name just 
a few, require database technology to be deployed in a variety of environments 
and for a number of different purposes. Such an extensive deployment will also 
require trustworthy, resilient database systems, as well as easy-to-manage and 
flexible ones, to which we can entrust our data in whatever form they are. 

The call for papers attracted a very large number of submissions, including 
294 research papers and 22 software demo proposals. The program committee 
selected 42 research papers, 2 industrial and application papers, and 15 software 
demos. The program was complemented by three keynote speeches, by Rick Hull, 
Keith Jeffery, and Bhavani Thuraisingham, and two panels. 

This volume collects all papers and software demos presented at the confe- 
rence, in addition to an invited paper. The research papers cover a broad variety 
of topics, ranging from well-established topics like data mining and indexing 
techniques to more innovative topics such as peer-to-peer systems and trustwor- 
thy systems. We hope that these proceedings will serve as a valuable reference 
for data management researchers and developers. 

Many people contributed to EDBT 2004. Clearly, foremost thanks go to the 
authors of all submitted papers. The increased number of submissions, compa- 
red to the previous years, showed that the database area is nowadays a key 
technological area with many exciting research directions. We are grateful for 
the dedication and hard work of all program committee members who made the 
review process both thorough and effective. We also thank the external referees 
for their important contribution to the review process. 
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Preface 



In addition to those who contributed to the review process, there are many 
others who helped to make the conference a success. Special thanks go to Lida 
Harami for maintaining the EDBT 2004 conference Web site, to Christiana Das- 
kalaki for helping with the proceedings material, and to Triaena Tours and Con- 
gress for the logistics and organizational support. The financial and in-kind sup- 
port by the conference sponsors is gratefully acknowledged. 

December 2003 Elisa Bertino, Stavros Christodoulakis 

Dimitris Plexousakis 
Vassilis Christoplrides, Manolis Koubarakis 
Klemens Bohm, Elena Ferrari 
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Converged Services: A Hidden Challenge for the 
Web Services Paradigm 



Richard Hull 

Bell Labs Research, Lucent Technologies, Murray Hill, NJ 07974 

The web has brought a revolution in sharing information and in human-computer 
interaction. The web services paradigm (based initially on standards such as 
SOAP, WSDL, UDDI, BPEL) will bring the next revolution, enabling flexible, 
intricate, and largely automated interactions between web-resident services and 
applications. But the telecommunications world is also changing, from isolated, 
monolithic legacy stove-pipes, to a much more modular, internet-style framework 
that will enable rich flexibility in creating communication and collaboration ser- 
vices. This will be enabled by the existing Parlay/OSA standard and emerging 
standards for all-IP networks, (e.g., 3GPP IMS). We are evolving towards a 
world of “converged” services, not two parallel worlds of web services vs. tele- 
com services. 

Converged services will arise in a variety of contexts, e.g., e-commerce and 
mobile commerce, collaboration systems, interactive games, education, and en- 
tertainment. This talk begins by discussing standards for the web and telecom, 
identifying key aspects that may need to evolve as the two networks converge. 
We then highlight research challenges created by the emergence of converged 
services along three dimensions: (1) profile data management, (2) preferences 
management, and (3) services composition. For (1) we describe a proposal from 
the wireless telecom community for giving services the end-user profile data they 
need, while respecting end-user concerns re privacy and data sharing [SHLX03]. 
For (2) we describe an approach to supporting high-speed preferences manage- 
ment, whereby service providers can inexpensively cater to the needs of a broad 
variety of applications and categories of end-users [HKL + 03b,HKL + 04]. We also 
discuss the issue of “federated policy management”, which arises because poli- 
cies around end-user preferences will be distributed across multiple applications 
and network components [HKL03a]. For (3) we discuss an emerging technology 
for composing web services based on behavioral signatures [BFHS03,HBCS03] 
and a key contrast between web services and telecom services [CKH+01]. 

References 
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Abstract. GRID technology, emerging in the late nineties, has evolved from a 
metacomputing architecture towards a pervasive computation and information 
utility. However, the architectural developments echo strongly the computa- 
tional origins and information systems engineering aspects have received scant 
attention. The development within the GRID community of the W3C-inspired 
OGSA indicates a willingness to move in a direction more suited to the wider 
end user requirements. In particular the OGSA/DAI initiative provides a web- 
services level interface to databases. In contrast to this stream of development, 
early architectural ideas for a more general GRIDs environment articulated in 
UK in 1999 have recently been more widely accepted, modified, evolved and 
enhanced by a group of experts working under the auspices of the new EC 
DGINFSO F2 (GRIDs) Unit. The resulting report on ‘Next Generation GRIDs’ 
was published in June 2003 and is released by the EC as an adjunct to the FP6 
Call for Proposals Documentation. The report proposes the need for a wealth of 
research in all aspects of information systems engineering, within which the 
topics of advanced distributed parallel multimedia heterogeneous database 
systems with greater representativity and expressivity have some prominence. 
Topics such as metadata, security, trust, persistence, performance, scalability 
are all included. This represents a huge opportunity for the database 
community, particularly in Europe. 



1 Introduction 

The concept of the GRID was initiated in the USA in the late 1990s. Its prime purpose 
was to couple supercomputers in order to provide greater computational power and to 
utilise otherwise wasted central processor cycles. Starting with computer-specialised 
closed systems that could not interoperate, the second generation consists essentially 
of middleware which schedules a computational task as batch jobs across multiple 
computers. However, the end-user interface is procedural rather than fully declarative 
and the aspects of resource discovery, data interfacing and process-process 
interconnection (as in workflow for a business process) are primitive compared with 
work on information systems engineering involving, for example, databases and web 
services. 
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Through GGF (Global GRID Forum) a dialogue has evolved the original GRID 
architecture to include concepts from the web services environment. OGSA (Open 
Grid Services Architecture) with attendant interfaces (OGSI) is now accepted by the 
GRID community and OGSA/DAI (Data Access interface) provides an interface to 
databases at rather low level. 

In parallel with this metacomputing GRID development, an initiative started in UK 
has developed an architecture for GRIDs that combines metacomputing (i.e. 
computation) with information systems. It is based on the argument that database 
R&D (research and development) - or more generally ISE (Information Systems 
Engineering) R&D - has not kept pace with the user expectations raised by WWW. 
Tim Berners-Lee threw down the challenge of the semantic web and the web of trust 
[1], The EC (European Commission) has argued for the information society, the 
knowledge society and the ERA (European Research Area) - all of which are 
dependent on database R&D in the ISE sense. This requires an open architecture 
embracing both computation and information handling, with integrated detection 
systems using instruments and with an advanced user interface providing ‘martini’ 
(anytime, anyhow, anywhere) access to the facilities. The GRIDs concept [6] 
addresses this challenge, and further elaboration by a team of experts has produced 
the EC-sponsored document ‘Next Generation GRID’ [3], 

It is time for the database community (in the widest sense, i.e. the information 
systems engineering community) to take stock of the research challenges and plan a 
campaign to meet them with excellent solutions, not only academically or 
theoretically correct but also well-engineered for end-user acceptance and use. 



2 GRIDs 

2.1 The Idea 

In 1998-1999 the UK Research Council community was proposing future 
programmes for R&D. The author was asked to propose an integrating IT architecture 
[6], The proposal was based on concepts including distributed computing, 
metacomputing, metadata, agent- and broker-based middleware, client-server 
migrating to three-layer and then peer-to-peer architectures and integrated knowledge- 
based assists. The novelty lay in the integration of various techniques into one 
architectural framework. 



2.2 The Requirement 

The UK Research Council community of researchers was facing several IT-based 
problems. Their ambitions for scientific discovery included post-genomic discoveries, 
climate change understanding, oceanographic studies, environmental pollution 
monitoring and modelling, precise materials science, studies of combustion processes, 
advanced engineering, pharmaceutical design, and particle physics data handling and 
simulation. They needed more processor power, more data storage capacity, better 
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analysis and visualisation - all supported by easy-to-use tools controlled through an 
intuitive user interface. 

On the other hand, much of commercial IT (Information Technology) including 
process plant control, management information and decision support systems, IT- 
assisted business processes and their re-engineering, entertainment and media systems 
and diagnosis support systems all require ever-increasing computational power and 
expedited information access, ideally through a uniform system providing a seamless 
information and computation landscape to the end-user. Thus there is a large potential 
market for GRIDs systems. 

The original proposal based the academic development of the GRIDs architecture and 
facilities on scientific challenging applications, then involving IT companies as the 
middleware stabilised to produce products which in turn could be taken up by the 
commercial world. During 2000 the UK e-Science programme was elaborated with 
funding starting in April 200 1 . 



2.3 Architecture Overview 

The architecture proposed consists of three layers (Fig.l). The computation / data grid 
has supercomputers, large servers, massive data storage facilities and specialised 
devices and facilities (e.g. for VR (Virtual Reality)) all linked by high-speed 
networking and forms the lowest layer. The main functions include compute load 
sharing / algorithm partitioning, resolution of data source addresses, security, 
replication and message rerouting. This layer also provides connectivity to detectors 
and instruments. The information grid is superimposed on the computation / data grid 
and resolves homogeneous access to heterogeneous information sources mainly 
through the use of metadata and middleware. Finally, the uppermost layer is the 
knowledge grid which utilises knowledge discovery in database technology to 
generate knowledge and also allows for representation of knowledge through 
scholarly works, peer-reviewed (publications) and grey literature, the latter especially 
hyperlinked to information and data to sustain the assertions in the knowledge. 

The concept is based on the idea of a uniform landscape within the GRIDs domain, 
the complexity of which is masked by easy-to-use interfaces. 



2.4 The GRID 

In 1998 - in parallel with the initial UK thinking on GRIDs - Ian Foster and Carl 
Kesselman published a collection of papers in a book generally known as ‘The GRID 
Bible’ [4] . The essential idea is to connect together supercomputers to provide more 
power - the metacomputing technique. However, the major contribution lies in the 
systems and protocols for compute resource scheduling. Additionally, the designers of 
the GRID realised that these linked supercomputers would need fast data feeds so 
developed GRIDFTP. Finally, basic systems for authentication and authorisation are 
described. The GRID has encompassed the use of SRB (Storage Request Broker) 
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from SDSC (San Diego Supercomputer Centre) for massive data handling. SRB has 
its proprietary metadata system to assist in locating relevant data resources. It also 
uses LDAP as its directory of resources. The GRID corresponds to the lowest grid 
layer (computation / data layer) of the GRIDs architecture. 



3 The GRIDs Architecture 

3.1 Introduction 

The idea behind GRIDs is to provide an IT environment that interacts with the user to 
determine the user requirement for service and then, having obtained the user’s 
agreement to ‘the deal’ satisfies that requirement across a heterogeneous environment 
of data stores, processing power, special facilities for display and data collection 
systems (including triggered automatic detection instruments) thus making the IT 
environment appear homogeneous to the end-user. 

Referring to Fig. 2, the major components external to the GRIDs environment are: 

a) users: each being a human or another system; 

b) sources: data, information or software 

c) resources: such as computers, sensors, detectors, visualisation or VR (virtual 
reality) facilities 

Each of these three major components is represented continuously and actively within 
the GRIDs environment by: 

1) metadata: which describes the external component and which is changed with 
changes in circumstances through events 

2) an agent: which acts on behalf of the external resource representing it within the 
GRIDs environment. 
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Fig. 2. The GRIDs Components 

As a simple example, the agent could be regarded as the answering service of a 
person’s mobile phone and the metadata as the instructions given to the service such 
as ‘divert to service when busy’ and / or ‘divert to service if unanswered’. 

Finally there is a component which acts as a ‘go between’ between the agents. These 
are brokers which, as software components, act much in the same way as human 
brokers by arranging agreements and deals between agents, by acting themselves (or 
using other agents) to locate sources and resources, to manage data integration, to 
ensure authentication of external components and authorisation of rights to use by an 
authenticated component and to monitor the overall system. 

From this it is clear that they key components are the metadata, the agents and the 
brokers. 



3.2 Metadata 

Metadata is data about data [7], An example might be a product tag attached to a 
product (e.g. a tag attached to a piece of clothing) that is available for sale. The 
metadata on the product tag tells the end-user (human considering purchasing the 
article of clothing) data about the article itself - such as the fibres from which it is 
made, the way it should be cleaned, its size (possibly in different classification 
schemes such as European, British, American) and maybe style, designer and other 
useful data. The metadata tag may be attached directly to the garment, or it may 
appear in a catalogue of clothing articles offered for sale (or, more usually, both). The 
metadata may be used to make a selection of potentially interesting articles of 
clothing before the actual articles are inspected, thus improving convenience. Today 
this concept is widely-used. Much e-commerce is based on B2C (Business to 
Customer) transactions based on an online catalogue (metadata) of goods offered. 
One well-known example is www.amazon.com . 
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view to users 



ASSOCIATIVE 



Fig. 3. Metadata Classification 

What is metadata to one application may be data to another. For example, an 
electronic library catalogue card is metadata to a person searching for a book on a 
particular topic, but data to the catalogue system of the library which will be grouping 
books in various ways: by author, classification code, shelf position, title - depending 
on the purpose required. 

It is increasingly accepted that there are several kinds of metadata. The classification 
proposed (Fig. 3) is gaining wide acceptance and is detailed below. 



Schema Metadata. Schema metadata constrains the associated data. It defines the 
intension whereas instances of data are the extension. From the intension a theoretical 
universal extension can be created, constrained only by the intension. Conversely, any 
observed instance should be a subset of the theoretical extension and should obey the 
constraints defined in the intension (schema). One problem with existing schema 
metadata (e.g. schemas for relational DBMS) is that they lack certain intensional 
information that is required [8]. Systems for information retrieval based on, e.g. the 
SGML (Standard Generalised Markup Language) DTD (Document Type Definition) 
experience similar problems. 

It is noticeable that many ad hoc systems for data exchange between systems send 
with the data instances a schema that is richer than that in conventional DBMS - to 
assist the software (and people) handling the exchange to utilise the exchanged data to 
best advantage. 
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Navigational Metadata. Navigational metadata provides the pathway or routing to 
the data described by the schema metadata or associative metadata. In the RDF model 
it is a URL (universal resource locator), or more accurately, a URI (Universal 
Resource Identifier). With increasing use of databases to store resources, the most 
common navigational metadata now is a URL with associated query parameters 
embedded in the string to be used by CGI (Common Gateway Interface) software or 
proprietary software for a particular DBMS product or DBMS-Webserver software 
pairing. 

The navigational metadata describes only the physical access path. Naturally, 
associated with a particular URI are other properties such as: 

a) security and privacy (e.g. a password required to access the target of the URI); 

b) access rights and charges (e.g. does one have to pay to access the resource at the 
URI target); 

c) constraints over traversing the hyperlink mapped by the URI (e.g. the target of the 
URI is only available if previously a field on a form has been input with a value 
between 10 and 20). Another example would be the hypermedia equivalent of 
referential integrity in a relational database; 

d) semantics describing the hyperlink such as ‘the target resource describes the son of 
the person described in the origin resource’ 

However, these properties are best described by associative metadata which then 
allows more convenient co-processing in context of metadata describing both 
resources and hyperlinks between them and - if appropriate - events. 



Associative Metadata. In the data and information domain associative metadata can 
describe: 

a) a set of data (e.g. a database, a relation (table) or a collection of documents or a 
retrieved subset). An example would be a description of a dataset collected as part of 
a scientific mission; 

b) an individual instance (record, tuple, document). An example would be a library 
catalogue record describing a book ; 

c) an attribute (column in a table, field in a set of records, named element in a set of 
documents). An example would be the accuracy / precision of instances of the 
attribute in a particular scientific experiment ; 

d) domain information (e.g. value range) of an attribute. An example would be the 
range of acceptable values in a numeric field such as the capacity of a car engine or 
the list of valid values in an enumerated list such as the list of names of car 
manufacturers; 

e) a record / field intersection unique value (i.e. value of one attribute in one instance) 
This would be used to explain an apparently anomalous value. 

In the relationship domain, associative metadata can describe relationships between 
sets of data e.g. hyperlinks. Associative metadata can - with more flexibility and 
expressivity than available in e.g. relational database technology or hypermedia 
document system technology - describe the semantics of a relationship, the 
constraints, the roles of the entities (objects) involved and additional constraints. 
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In the process domain, associative metadata can describe (among other things) the 
functionality of the process, its external interface characteristics, restrictions on 
utilisation of the process and its performance requirements / characteristics. 

In the event domain, associative metadata can describe the event, the temporal 
constraints associated with it, the other constraints associated with it and actions 
arising from the event occurring. 

Associative metadata can also be personalised: given clear relationships between 
them that can be resolved automatically and unambiguously, different metadata 
describing the same base data may be used by different users. 

Taking an orthogonal view over these different kinds of information system objects to 
be described, associative metadata may be classified as follows: 

1) descriptive: provides additional information about the object to assist in 
understanding and using it; 

2) restrictive: provides additional information about the object to restrict access to 
authorised users and is related to security, privacy, access rights, copyright and IPR 
(Intellectual Property Rights); 

3) supportive: a separate and general information resource that can be cross-linked to 
an individual object to provide additional information e.g. translation to a different 
language, super- or sub-terms to improve a query - the kind of support provided by a 
thesaurus or domain ontology; 

Most examples of metadata in use today include some components of most of these 
kinds but neither structured formally nor specified formally so that the metadata tends 
to be of limited use for automated operations - particularly interoperation - thus 
requiring additional human interpretation. 



3.3 Agents 

Agents operate continuously and autonomously and act on behalf of the external 
component they represent. They interact with other agents via brokers, whose task it 
is to locate suitable agents for the requested purpose. An agent’s actions are 
controlled to a large extent by the associated metadata which should include either 
instructions, or constraints, such that the agent can act directly or deduce what action 
is to be taken. Each agent is waiting to be ‘woken up’ by some kind of event; on 
receipt of a message the agent interprets the message and - using the metadata as 
parametric control - executes the appropriate action, either communicating with the 
external component (user, source or resource) or with brokers as a conduit to other 
agents representing other external components. 

An agent representing an end-user accepts a request from the end-user and interacts 
with the end-user to refine the request (clarification and precision), first based on the 
user metadata and then based on the results of a first attempt to locate (via brokers 
and other agents) appropriate sources and resources to satisfy the request. The 
proposed activity within GRIDs for that request is presented to the end-user as a 
‘deal’ with any costs, restrictions on rights of use etc. Assuming the user accepts the 
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offered deal, the GRIDs environment then satisfies it using appropriate resources and 
sources and finally sends the result back to the user agent where - again using 
metadata - end-user presentation is determined and executed. 

An agent representing a source will - with the associated metadata - respond to 
requests (via brokers) from other agents concerning the data or information stored, or 
the properties of the software stored. Assuming the deal with the end-user is accepted, 
the agent performs the retrieval of data requested, or supply of software requested. 

An agent representing a resource - with the associated metadata - responds to 
requests for utilisation of the resource with details of any costs, restrictions and 
relevant capabilities. Assuming the deal with the end-user is accepted the resource 
agent then schedules its contribution to providing the result to the end-user. 



3.4 Brokers 

Brokers act as ‘go betweens’ between agents. Their task is to accept messages from 
an agent which request some external component (source, resource or user), identify 
an external component that can satisfy the request by its agent working with its 
associated metadata and either put the two agents in direct contact or continue to act 
as an intermediary, possibly invoking other brokers (and possibly agents) to handle, 
for example, measurement unit conversion or textual word translation. 

Other brokers perform system monitoring functions including overseeing 
performance (and if necessary requesting more resources to contribute to the overall 
system e.g. more networking bandwidth or more compute power). They may also 
monitor usage of external components both for statistical purposes and possibly for 
any charging scheme. 



3.5 The Components Working Together 

Now let us consider how the components interact. An agent representing a user may 
request a broker to find an agent representing another external component such as a 
source or a resource. The broker will usually consult a directory service (itself 
controlled by an agent) to locate potential agents representing suitable sources or 
resources. The information will be returned to the requesting (user) agent, probably 
with recommendations as to order of preference based on criteria concerning the 
offered services. The user agent matches these against preferences expressed in the 
metadata associated with the user and makes a choice. The user agent then makes the 
appropriate recommendation to the end-user who in turn decides to ‘accept the deal’ 
or not. 
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4 Ambient Computing 

The concept of ambient computing implies that the computing environment is always 
present and available in an even manner. The concept of pervasive computing implies 
that the computing environment is available everywhere and is ‘into everything’ . The 
concept of mobile computing implies that the end-user device may be connected even 
when on the move. In general usage of the term, ambient computing implies both 
pervasive and mobile computing. 

The idea, then, is that an end-user may find herself connected (or connectable - she 
may choose to be disconnected) to the computing environment all the time. The 
computing environment may involve information provision (access to database and 
web facilities), office functions (calendar, email, directory), desktop functions (word 
processing, spreadsheet, presentation editor), perhaps project management software 
and systems specialised for her application needs - accessed from her end-user device 
connected back to ‘home base’ so that her view of the world is as if at her desk. In 
addition entertainment subsystems (video, audio, games) should be available. 

A typical configuration might comprise: 

a) a headset with earphone(s) and microphone for audio communication, connected 
by bluetooth wireless local connection to 

b) a PDA (personal digital assistant) with small screen, numeric/text keyboard (like a 
telephone), GSM/GPRS (mobile phone) connections for voice and data, wireless 
LAN connectivity and ports for connecting sensor devices (to measure anything close 
to the end-user) in turn connected by bluetooth to 

c) an optional notebook computer carried in a backpack (but taken out for use in a 
suitable environment) with conventional screen, keyboard, large hard disk and 
connectivity through GSM/GPRS, wireless LAN, cable LAN and dial-up telephone; 
The end-user would perhaps use only (a) and (b) (or maybe (b) alone using the built 
in speaker and microphone) in a social or professional context as mobile phone and 
‘filofax’, and as entertainment centre, with or without connectivity to ‘home base’ 
servers and IT environment. For more traditional working requiring keyboard and 
screen the notebook computer would be used, probably without the PDA. The two 
might be used together with data collection validation / calibration software on the 
notebook computer and sensors attached to the PDA. 

The balance between that (data, software) which is on servers accessed over the 
network and that which is on (one of) the end-user device(s) depends on the mode of 
work, speed of required response and likelihood of interrupted connections. Clearly 
the GRIDs environment is ideal for such a user to be connected. 

Such a configuration is clearly useful for a ‘road warrior’ (travelling salesman), for 
emergency services such as firefighters or paramedics, for businessmen, for 
production industry managers, for the distribution / logistics industry (warehousing, 
transport, delivery), for scientists in the field... and also for leisure activities such as 
mountain walking, visiting an art gallery, locating a restaurant or visiting an 
archaeological site. 
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5 The Challenges 

Such an IT architectural environment inevitably poses challenging research issues. 
The major ones are: 



5.1 Metadata 

Since metadata is critically important for interoperation and semantic understanding, 
there is a requirement for precise and formal representation of metadata to allow 
automated processing. Research is required into the metadata representation language 
expressivity in order to represent the entities user, source, resource. For example, the 
existing Dublin Core Metadata standard [2] is machine-readable but not machine- 
understandable, and furthermore mixes navigational, associative descriptive and 
associative restrictive metadata. A formal version has been proposed [Je99]. 



5.2 Agents 

There is an interesting research area concerning the generality or specificity of agents. 
Agents could be specialised for a particular task or generalised and configured 
dynamically for the task by metadata. Furthermore, agents may well need to be 
reactive and dynamically reconfigured by events / messages. This would cause a 
designer to lean towards general agents with dynamic configuration, but there are 
performance, reliability and security issues. In addition there are research issues 
concerning the syntax and semantics of messages passed between agents and brokers 
to ensure optimal representation with appropriate performance and security. 



5.3 Brokers 

A similar research question is posed for brokers - are they generalised and dynamic 
or specific? However, brokers have not just representational functions, they have also 
to negotiate. The degree of autonomy becomes the key research issue: can the broker 
decide by itself or does it solicit input from the external entity (user, source, resource) 
via its agent and metadata? The broker will need general strategic knowledge 
(negotiation techniques) but the way a broker uses the additional information supplied 
by the agents representing the entities could be a differentiating factor and therefore a 
potential business benefit. In addition there are research issues concerning the syntax 
and semantics of messages passed between brokers to ensure optimal representation 
with appropriate performance and security. 



5.4 Security 

Security is an issue in any system, and particularly in a distributed system. It becomes 
even more important if the system is a common marketplace with great heterogeneity 
of purpose and intent. The security takes the forms: 
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a) prevention of unauthorised access: this requires authentication of the user, 
authorisation of the user to access or use a source or resource and provision or denial 
of that access. The current heterogeneity of authentication and authorisation 
mechanisms provides many opportunities for deliberate or unwitting security 
exposure; 

b) ensuring availability of the source or resource: this requires techniques such as 
replication, mirroring and hot or warm failover. There are deep research issues in 
transactions and rollback/recovery and optimisation; 

c) ensuring continuity of service: this relates to (b) but includes additional fallback 
procedures and facilities and there are research issues concerning the optimal (cost- 
effective) assurance of continuity. 

In the case of interrupted communication there is a requirement for synchronisation of 
the end-user’s view of the system between that which is required on the PDA and / or 
laptop and the servers. 

There are particular problems with wireless communications because of interception. 
Encryption of sensitive transmissions is available but there remain research issues 
concerning security assurance. 



5.5 Privacy 

The privacy issues concern essentially the tradeoff of personal information provision 
for intelligent system reaction. There are research issues on the optimal balance for 
particular end-user requirements. Furthermore, data protection legislation in countries 
varies and there are research issues concerning the requirement to provide data or to 
conceal data. 



5.6 Trust 

When any end-user purchases online (e.g. a book from www.amazon.com ) there is a 
trust that the supplier will deliver the goods and that the purchaser’s credit card 
information is valid. This concept requires much extension in the case of contracts for 
supply of engineered components for assembly into e.g. a car. The provision of an e- 
marketplace brings with it the need for e-tendering, e-contracts, e-payments, e- 
guarantees as well as opportunities to re-engineer the business process for 
effectiveness and efficiency. This is currently a very hot research topic since it 
requires the representation in an IT system of artefacts (documents) associated with 
business transactions. 



5.7 Interoperability 

There is a clear need to provide the end-user with homogeneous access to 
heterogeneous information sources. His involves schema reconciliation / mapping and 
associated transformations. Associated with this topic are requirements for languages 
that are more representative (of the entities / objects in the real world) and more 
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expressive (in expressing the transformations or operations). Recent R&D [10], [9] has 
indicated that graphs provide a neutral basis for the syntax with added value in graph 
properties such that structural properties may be used. 



5.8 Data Quality 

The purpose of data, especially when structured in context as information, is to 
represent the world of interest. There are real research issues in ensuring this is true - 
especially when the data is incomplete or uncertain, when the data is subject to certain 
precision, accuracy and associated calibration constraints or when only by knowing 
its provenance can a user utilise it confidently. 



5.9 Performance 

The architecture opens the possibility of, knowing the characteristics of data / 
information, software and processing power on each node , generating optimal 
execution plans. Refinements involve data movement (expensive if the volumes are 
large) or program code movement (security implications) to appropriate nodes. 



6 Conclusion 

The GRIDs architecture will provide an IT infrastructure to revolutionise and expedite 
the way in which we do business and achieve leisure. The Ambient Computing 
architecture will revolutionise the way in which the IT infrastructure intersects with 
our lives, both professional and social. The two architectures in combination will 
provide the springboard for the greatest advances yet in Information Technology. This 
can only be achieved by excellent R&D leading to commercial take-up and 
development of suitable products, to agreed standards, ideally within an environment 
such as W3C (the World Wide Web Consortium). The current efforts in GRID 
computing have moved some way away from metacomputing and towards the 
architecture described here with the adoption of OGSA (Open Grids Services 
Architecture). However, there is a general feeling that Next Generation GRID 
requires an architecture rather like that described here, as reported in the Report of the 
EC Expert Group on the subject [3], 
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Abstract. A semantic web can be thought of as a web that is highly in- 
telligent and sophisticated and one needs little or no human intervention 
to carry out tasks such as scheduling appointments, coordinating activ- 
ities, searching for complex documents as well as integrating disparate 
databases and information systems. While much progress has been made 
toward developing such an intelligent web, there is still a lot to be done. 
For example, there is little work on security and privacy for the semantic 
web. However, before we examine security for the semantic web we need 
to ensure that its key components, such as web databases and services, 
are secure. This paper will mainly focus on security and privacy issues for 
web databases and services. Finally, some directions toward developing 
a secure semantic web will be provided. 



1 Introduction 

Recent developments in information systems technologies have resulted in com- 
puterizing many applications in various business areas. Data has become a criti- 
cal resource in many organizations, and, therefore, efficient access to data, shar- 
ing the data, extracting information from the data, and making use of the in- 
formation has become an urgent need. As a result, there have been many efforts 
on not only integrating the various data sources scattered across several sites, 
but also on extracting information from these databases in the form of patterns 
and trends. These data sources may be databases managed by Database Man- 
agement Systems (DBMSs), or they could be data warehoused in a repository 
from multiple data sources. The advent of the World Wide Web (WWW) in the 
mid 1990s has resulted in even greater demand for managing data, information, 
and knowledge effectively. There is now so much data on the web that managing 
them with conventional tools is becoming almost impossible. As a results, to 
provide interoperability as well as warehousing between multiple data sources 
and systems, and to extract information from the databases and warehouses on 
the web, various tools are being developed. 

As the demand for data and information management increases, there is also 
a critical need for maintaining the security of the databases, applications, and 
information systems. Data and information have to be protected from unautho- 
rized access as well as from malicious corruption. With the advent of the web 
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it is even more important to protect the data and information as numerous in- 
dividuals now have access to them. Therefore, we need effective mechanisms for 
securing data and applications. The web is now evolving into the semantic web. 
Semantic web is about ensuring that web pages can be read and understood by 
machines. The major components for the semantic web include web infrastruc- 
tures, web databases and services, and ontology management and information 
integration. There has been a lot of work on each of these three areas. How- 
ever, very little work has been devoted to security. If the semantic web is to be 
effective, we need to ensure that the information on the web is protected from 
unauthorized accesses and malicious modifications. We also need to ensure that 
individual’s privacy is maintained. This paper focuses on security and privacy 
related to one of the component for the semantic web, that is, for web databases 
and services. 

The organization of this paper is as follows. In Section 2 we give some back- 
ground information on web databases and services. Security and privacy for 
web databases will be discussed in Section 3, whereas security and privacy for 
web services will be discussed in Section 4. Some issues on developing a secure 
semantic web will be discussed in Section 5. The paper is concluded in Section 6. 

2 Background on Web Databases and Services 

This paper focuses on security and privacy for web databases and services and 
therefore in this section we provide some background information about them. 



2.1 Web Data Management 

A major challenge for web data management is coming up with an appropriate 
data representation scheme. The question is: is there a need for a standard data 
model? Is it at all possible to develop such a standard? If so, what are the re- 
lationships between the standard model and the individual models used by the 
databases on the web? The significant development for web data modeling came 
in the latter part of 1996 when the World Wide Web Consortium (W3C) [15] was 
formed. This group felt that web data modeling was an important area and be- 
gan addressing the data modeling aspects. Then, sometime around 1997 interest 
in XML (Extensible Markup Language) began. This was an effort of the W3C. 
XML is not a data model. It is a metalanguage for representing documents. The 
idea is that if documents are represented using XML then these documents can 
be uniformly represented and therefore exchanged on the web. Database man- 
agement functions for the web include those such as query processing, metadata 
management, security, and integrity. Querying and browsing are two of the key 
functions. First of all, an appropriate query language is needed. Since SQL is 
a popular language, appropriate extensions to SQL may be desired. XML-QL 
and XQuery [15] are moving in this direction. Query processing involves devel- 
oping a cost model. Are there special cost models for Internet database man- 
agement? With respect to browsing operations, the query processing techniques 
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have to be integrated with techniques for following links. That is, hypermedia 
technology has to be integrated with database management technology. Transac- 
tion management is essential for many applications. There may be new kinds of 
transactions for web data management. For example, various items may be sold 
through the Internet. In this case, the item should not be locked immediately 
when a potential buyer makes a bid. It has to be left open until several bids are 
received and the item is sold. That is, special transaction models are needed. 
Appropriate concurrency control and recovery techniques have to be developed 
for the transaction models. Metadata management is also a major concern. The 
question is, what is metadata? Metadata describes all of the information per- 
taining to a data source. This could include the various web sites, the types of 
users, access control issues, and policies enforced. Where should the metadata 
be located? Should each participating site maintain its own metadata? Should 
the metadata be replicated or should there be a centralized metadata reposi- 
tory? Storage management for Internet database access is a complex function. 
Appropriate index strategies and access methods for handling multimedia data 
are needed. In addition, due to the large volumes of data, techniques for inte- 
grating database management technology with mass storage technology are also 
needed. Maintaining the integrity of the data is critical. Since the data may 
originate from multiple sources around the world, it will be difficult to keep tabs 
on the accuracy of the data. Appropriate data quality maintenance techniques 
need thus be developed. Other data management functions include integrating 
heterogeneous databases, managing multimedia data, and mining. Security and 
privacy is a major challenge. This is one of the main focus areas for this paper 
and will be discussed in Section 3. 

2.2 Web Services 

Web services can be defined as an autonomous unit of application logic that 
provides either some business functionality features or information to other 
applications through an Internet connection. They are based on a set of 
XML standards, namely, the Simple Object Access Protocol (SOAP) [15] 
to expose the service functionalities, the Web Services Description Language 
(WSDL) [15] - to provide an XML-based description of the service interface, 
and the Universal Description, Discovery and Integration (UDDI) [16] to 
publish information regarding the web service and thus making this information 
available to potential clients. UDDI provides an XML-based structured and 
standard description of web service functionalities, as well as searching facilities 
to help in finding the provider(s) that better fit the client requirements. More 
precisely, an UDDI registry is a collection of entry, each of one providing 
information on a specific web service. Each entry is in turn composed by five 
main data structures businessEntity, businessService, bindingTemplate, 
publisherAssertion, and tModel, which provide different information on the 
web service. For instance, the BusinessEntity data structure provides overall 
information about the organization providing the web service, whereas the 
BusinessService data structure provides a technical description of the service. 
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Searching facilities provided by UDDI registries are of two different types, which 
result in two different types of inquiries that can be submitted to an UDDI 
registry: drill-down pattern inquiries (i.e., getjxxx API functions), which return 
a whole core data structure (e.g., businessTemplate, businessEntity), and 
browse pattern inquiries (i.e., findjxxx: API functions), which return overview 
information about the registered data. 

As far as architectural aspects are concerned, three are the main entities 
composing the Web Service Architecture (WSA): the service provider, which is 
the person or organization that provides the web service, the service requestor, 
which is a person or organization that wishes to make use of the services offered 
by a provider for achieving its business requirements, and the discovery agency, 
which manages UDDI registries. UDDI registries can be implemented according 
to either a third-party or a two-party architecture, with the main difference that 
in a two-party architecture there is no distinction between the service provider 
and the discovery agency, whereas in a third-party architecture the discovery 
agency and the service provider are two separate entities. It is important to note 
that today third-party architectures are becoming more and more widely used 
for any web-based system, due to their scalability and the ease with which they 
are able to manage large amount of data and large collections of users. 

3 Security and Privacy for Web Databases 

Security issues for web databases include secure management of structured 
databases as well as unstructured and semistructured databases, and privacy 
issues. In the following sections we discuss all these aspects. 

3.1 Security for Structured Databases on the Web 

A lot of research has been done for developing access control models for Rela- 
tional and Object-oriented DBMSs [6] . For example, today most of the commer- 
cial DBMSs rely on the System R access control model. However, the web intro- 
duces new challenges. For instance, a key issue is related to the population ac- 
cessing web databases which is greater and more dynamic than the one accessing 
conventional DBMSs. This implies that traditional identity-based mechanisms 
for performing access control are not enough. Rather a more flexible way of qual- 
ifying subjects is needed, for instance based on the notion of role or credential. 
Next we need to examine the security impact on all of the web data management 
functions. These include query processing, transaction management, index and 
storage management, and metadata management. For example, query process- 
ing algorithms may need to take into consideration the access control policies. 
We also need to examine the trust that must be placed in the modules of the 
query processor. Transaction management algorithms may also need to consider 
the security policies. For example, the transaction will have to ensure that the 
integrity as well as security constraints are satisfied. We need to examine the 
security impact in various indexing and storage strategies. For example, how 
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do we store the databases on the web that will ease the enforcement of secu- 
rity policies? Metadata includes not only information about the resources, which 
includes databases and services, it also includes security policies. We need effi- 
cient metadata management techniques for the web as well as use metadata to 
enhance security. 

3.2 Security for XML, RDF, and Ontology Databases 

As we evolve the web into the semantic web, we need the capability to manage 
XML and RDF databases. This means that we need to ensure secure access to 
these databases. 

Various research efforts have been reported for securing XML documents and 
XML databases [11]. Here, we briefly discuss some of the key points. XML doc- 
uments have graph structures. The main challenge is thus to develop an access 
control model which exploits this graph structure in the specification of policies 
and which is able to support a wide spectrum of access granularity levels, rang- 
ing from sets of documents, to single documents, to specific portions within a 
document, as well as the possibility of specifying both content-dependent and 
content-independent access control policies. A proposal in this direction is the 
access control model developed in the framework of the Author- A project [5], 
which provides the support for both access control as well as dissemination poli- 
cies. Policies are specified in XML and contain information about which subjects 
can access which portions of the documents. Subjects are qualified by means of 
credentials, specified using XML. In [5] algorithms for access control as well as 
computing views of the results are also presented. In addition, architectures for 
securing XML documents are also discussed. In [3] the authors go further and 
describe how XML documents may be securely published on the web. The idea 
is for owners to publish documents, subjects to request access to the documents, 
and untrusted publishers to give the subjects the views of the documents they 
are authorized to see, making at the same time the subjects able to verify the 
authenticity and completeness of the received answer. 

The W3C [15] is also specifying standards for XML security. The XML secu- 
rity project is focusing on providing the implementation of security standards for 
XML. The focus is on XML-Signature Syntax and Processing, XML-Encryption 
Syntax and Processing, and XML Key Management. While the standards are 
focusing on what can be implemented in the near-term lot of research is needed 
on securing XML documents. The work reported in [5] is a good start. 

Berners Lee who coined the term semantic web (see [2]) has stressed that 
the key to developing a semantic web is efficiently managing RDF documents. 
That is, RDF is fundamental to the semantic web. While XML is limited in 
providing machine understandable documents, RDF handles this limitation. As 
a result, RDF provides better support for interoperability as well as searching 
and cataloging. It also describes contents of documents as well as relationships 
between various entities in the document. While XML provides syntax and no- 
tations, RDF supplements this by providing semantic information in a standard- 
ized way. Now to make the semantic web secure, we need to ensure that RDF 




22 



E. Ferrari and B. Thuraisingham 



documents are secure. This would involve securing XML from a syntactic point 
of view. However with RDF we also need to ensure that security is preserved at 
the semantic level. The issues include the security implications of the concepts 
resource, properties and statements that are part of the RDF specification. That 
is, how is access control ensured? How can one provide access control at a fine 
granularity level? What are the security properties of the container model? How 
can bags, lists and alternatives be protected? Can we specify security policies 
in RDF? How can we solve semantic inconsistencies for the policies? How can 
we express security constraints in RDF? What are the security implications of 
statements about statements? How can we protect RDF schemas? These are dif- 
ficult questions and we need to start research to provide answers. XML security 
is just the beginning. Securing RDF is much more challenging. 

Another aspect of web data management is managing ontology databases. 
Now, ontologies may be expressed in RDF and related languages. Therefore, the 
issues for securing ontologies may be similar to securing RDF documents. That 
is, access to the ontologies may depend on the roles of the user, and/or on the 
credentials he or she may possess. On the other hand, one could use ontologies 
to specify security policies. That is, ontologies may help in securing the semantic 
web. We need more research in this area. 



3.3 Privacy for Web Databases 

Privacy is about protecting information about individuals. Privacy has been 
discussed a great deal in the past especially when it relates to protecting medical 
information about patients. Social scientists as well as technologists have been 
working on privacy issues. However, privacy has received enormous attention 
during the past year. This is mainly because of the advent of the web and 
now the semantic web, counter-terrorism and national security. For example, 
in order to extract information from databases about various individuals and 
perhaps prevent and/or detect potential terrorist attacks, data mining tools are 
being examined. We have heard a lot about national security vs. privacy in the 
media. This is mainly due to the fact that people are now realizing that to handle 
terrorism, the government may need to collect data about individuals and mine 
the data to extract information. This is causing a major concern with various 
civil liberties unions. In this section, we discuss privacy threats that arise due to 
data mining and the semantic web. We also discuss some solutions and provide 
directions for standards. 



Data mining, national security, privacy and web databases. With the 
web there is now an abundance of data information about individuals that one 
can obtain within seconds. The data could be structured data or could be mul- 
timedia data. Information could be obtained through mining or just from in- 
formation retrieval. Data mining is an important tool in making the web more 
intelligent. That is, data mining may be used to mine the data on the web so 
that the web can evolve into the semantic web. However, this also means that 
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there may be threats to privacy (see [12]). Therefore, one needs to enforce pri- 
vacy controls on databases and data mining tools on the semantic web. This is a 
very difficult problem. In summary, one needs to develop techniques to prevent 
users from mining and extracting information from data whether they are on 
the web or on networked servers. Note that data mining is a technology that is 
critical for say analysts so that they can extract patterns previously unknown. 
However, we do not want the information to be used in an incorrect manner. 
For example, based on information about a person, an insurance company could 
deny insurance or a loan agency could deny loans. In many cases these denials 
may not be legitimate. Therefore, information providers have to be very careful 
in what they release. Also, data mining researchers have to ensure that privacy 
aspects are addressed. While little work has been reported on privacy issues for 
web databases we are moving in the right direction. As research initiatives are 
started in this area, we can expect some progress to be made. Note that there 
are also social and political aspects to consider. That is, technologists, sociolo- 
gists, policy experts, counter-terrorism experts, and legal experts have to work 
together to develop appropriate data mining techniques as well as ensure pri- 
vacy. Privacy policies and standards are also urgently needed. That is, while the 
technologists develop privacy solutions, we need the policy makers to work with 
standards organizations (i.e., W3C) so that appropriate privacy standards are 
developed. 



Solutions to the privacy problem for web databases. As we have men- 
tioned, the challenge is to provide solutions to enhance national security as well 
as extract useful information but at the same time ensure privacy. There is now 
research at various laboratories on privacy enhanced/sensitive data mining (e.g., 
Agrawal at IBM Almaden, Gehrke at Cornell University and Clifton at Purdue 
University, see for example [1], [7], [8]). The idea here is to continue with mining 
but at the same time ensure privacy as much as possible. For example, Clifton 
has proposed the use of the multiparty security policy approach for carrying out 
privacy sensitive data mining. While there is some progress we still have a long 
way to go. Some useful references are provided in [7]. We give some more details 
on an approach we are proposing. Note that one mines the data and extracts 
patterns and trends. The idea is that privacy constraints determine which pat- 
terns are private and to what extent. For example, suppose one could extract the 
names and healthcare records. If we have a privacy constraint that states that 
names and healthcare records are private then this information is not released 
to the general public. If the information is semi-private, then it is released to 
those who have a need to know. Essentially, the inference controller approach we 
have proposed in [14] is one solution to achieve some level of privacy. It could be 
regarded to be a type of privacy sensitive data mining. In our research we have 
found many challenges to the inference controller approach. These challenges 
will have to be addressed when handling privacy constraints (see also [13]). For 
example, there are data mining tools on the web that mine web databases. The 
privacy controller should ensure privacy preserving data mining. Ontologies may 
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be used by the privacy controllers. For example, there may be ontology speci- 
fication for privacy constructs. Furthermore, XML may be extended to include 
privacy constraints. RDF may incorporate privacy semantics. We need to carry 
out more research on the role of ontologies for privacy control. Much of the work 
on privacy preserving data mining focuses on relational data. We need to carry 
out research on privacy preserving web data mining which contains unstructured 
data. We need to combine techniques for privacy preserving data mining with 
techniques for web data mining to obtain solutions for privacy preserving web 
data mining. 



4 Security and Privacy for Web Services 

Security and privacy concerns related to web services are receiving today growing 
attention from both the industry and research community [9]. Although most 
of the security and privacy concerns are similar to those of many web-based 
applications, one distinguishing feature of the Web Service Architecture is that 
it relies on a repository of information, i.e. , the UDDI registry, which can be 
queried by service requestors and populated by service providers. Even if, at the 
beginning, UDDI has been mainly conceived as a public registry without specific 
facilities for security and privacy, today security and privacy issues are becoming 
more and more crucial, due to the fact that data published in UDDI registries 
may be highly strategic and sensitive. For instance, a service provider may not 
want that the information about its web services are accessible to everyone, or 
a service requestor may want to validate the privacy policy of the discovery 
agency before interacting with this entity. In the following, we thus mainly focus 
on security and privacy issues related to UDDI registries management. We start 
by considering security issues, then we deal with privacy. 



4.1 Security for Web Services 

When dealing with security, three are the main issues that need to be faced: 
authenticity , integrity , and confidentiality. In the framework of UDDI, the au- 
thenticity property mainly means that the service requestor is assured that the 
information it receives from the UDDI comes from the source it claims to be 
from. Ensuring integrity means ensuring that the information are not altered 
during its transmission from the source to the intended recipients and that data 
are modified according to the specified access control policies. Finally, confiden- 
tiality means that information in the UDDI registry can only be disclosed to 
requestors authorized according to some specified access control policies. If a 
two-party architecture is adopted, security properties can be ensured using the 
strategies adopted in conventional DBMSs [6], since the owner of the informa- 
tion (i.e., the service provider) is also responsible for managing the UDDI. By 
contrast, such standard mechanisms must be revised when a third-party archi- 
tecture is adopted. The big issue there is how the provider of the services can 
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ensure security properties to its data, even if the data are managed by a discov- 
ery agency. The most intuitive solution is that of requiring the discovery agency 
to be trusted with respect to the considered security properties. However, the 
main drawback of this solution is that large web-based systems cannot be easily 
verified to be trusted and can be easily penetrated. The challenge is then how 
such security properties can be ensured without requiring the discovery agency 
to be trusted. 

In the following, we discuss each of the above-mentioned security properties 
in the context of both a two-party and a third-party architecture. 



Integrity and confidentiality. If UDDI registries are managed according to 
a two-party architecture, integrity and confidentiality can be ensured using the 
standard mechanisms adopted by conventional DBMSs [6]. In particular, an 
access control mechanism can be used to ensure that UDDI entries are accessed 
and modified only according to the specified access control policies. Basically, 
an access control mechanism is a software module that filters data accesses on 
the basis of a set of access control policies. Only the accesses authorized by the 
specified policies are granted. Additionally, data can be protected during their 
transmission from the data server to the requestor using standard encryption 
techniques [10]. 

If a third-party architecture is adopted, the access control mechanism must 
reside at the discovery agency site. However, the drawback of this solution is 
that the discovery agency must be trusted. An alternative approach to relax this 
assumption is that of using a technique similar to the one proposed in [5] for the 
secure broadcasting of XML documents. Basically, the idea is that the service 
provider encrypts the entries to be published in an UDDI registry according to its 
access control policies: all the entry portions to which the same policies apply are 
encrypted with the same key. Then, it publishes the encrypted copy of the entries 
to the UDDI. Additionally, the service provider is responsible for distributing 
keys to the service requestors in such a way that each service requestor receives 
all and only the keys corresponding to the information it is entitled to access. 
However, exploiting such solution requires the ability of querying encrypted data. 



Authenticity. The standard approach for ensuring authenticity is using digi- 
tal signature techniques [10]. To cope with authenticity requirements, the latest 
UDDI specifications allow one to optionally sign some of the elements in a reg- 
istry, according to the W3C XML Signature syntax [15]. This technique can be 
successfully employed in a two-party architecture. However, it does not fit well 
in the third-party model, if we do not want to require the discovery agency be 
trusted wrt authenticity. In such a scenario, it is not possible to directly apply 
standard digital signature techniques, since a service requestor may require only 
selected portions of an entry, depending on its needs, or a combination of infor- 
mation residing in different data structures. Additionally, some portions of the 
requested information could not be delivered to the requestor because of access 
constraints stated by the specified policies. A solution that can be exploited 
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in this context (which has been proposed in [4]) is that of applying to UDDI 
entries the authentication mechanism provided by Merkle hash trees. The ap- 
proach requires that the service provider sends the discovery agency a summary 
signature, generated using a technique based on Merkle hash trees, for each en- 
try it is entitled to manage. When a service requestor queries the UDDI registry, 
the discovery agency sends it, besides the query result, also the signatures of 
the entries on which the enquiry is performed. In this way, the requestor can 
locally recompute the same hash value signed by the service provider, and by 
comparing the two values it can verify whether the discovery agency has altered 
the content of the query answer and can thus verify its authenticity. However, 
since a requestor may be returned only selected portions of an entry, it may not 
be able to recompute the summary signature, which is based on the whole entry. 
For this reason, the discovery agency sends the requestor a set of additional hash 
values, referring to the missing portions, that make it able to locally perform 
the computation of the summary signature. We refer the interested readers to 
[4] for the details of the approach. 

4.2 Privacy for Web Services 

To enable privacy protection for web services consumers across multiple domains 
and services, the World Wide Web Consortium working draft Web Services Ar- 
chitecture Requirements has already been defined some specific privacy require- 
ments for web services [15]. In particular, the working draft specifies five privacy 
requirements for enabling privacy protection for the consumer of a web service 
across multiple domains and services: 

— the WSA must enable privacy policy statements to be expressed about web 
services; 

— advertised web service privacy policies must be expressed in P3P [15]; 

— the WSA must enable a consumer to access a web service’s advertised privacy 
policy statement; 

— the WSA must enable delegation and propagation of privacy policy; 

— web services must not be precluded from supporting interactions where one 
or more parties of the interaction are anonymous. 

Most of these requirements have been recently studied and investigated in 
the W3C P3P Beyond HTTP task force [15]. Further, this task force is work- 
ing on the identification of the requirements for adopting P3P into a number of 
protocols and applications other than HTTP, such as XML applications, SOAP, 
and web services. As a first step to privacy protection, the W3C P3P Beyond 
HTTP task force recommends that discovery agencies have their own privacy 
policies that govern the use of data collected both from service providers and 
service requestors. In this respect, the main requirement stated in [15] is that 
collected personal information must not be used or disclosed for purposes other 
than performing the operations for which it was collected, except with the con- 
sent of the subject or as required by law. Additionally, such information must 
be retained only as long as necessary for performing the required operations. 
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5 Towards a Secure Semantic Web 

For the semantic web to be secure all of its components have to be secure. These 
components include web databases and services, XML and RDF documents, and 
information integration services. As more progress is made on investigating the 
various security issues for these components, then we could envisage developing a 
secure semantic web. Note that logic, proof and trust are at the highest layers of 
the semantic web. Security cuts across all layers and this is a challenge. That is, 
we need security for each of the layer and we must also ensure secure interoper- 
ability. For example, consider the lowest layer. One needs secure TCP/IP, secure 
sockets, and secure HTTP. There are now security protocols for these various 
lower layer protocols. One needs end-to-end security. That is, one cannot just 
have secure TCP/IP built on untrusted communication layers. That is, we need 
network security. Next layer is XML. One needs secure XML. That is, access 
must be controlled to various portions of the document for reading, browsing 
and modifications. There is research on securing XML. The next step is securing 
RDF. Now with RDF not only do we need secure XML, we also need security 
for the interpretations and semantics. For example, under certain contexts, por- 
tions of the document may be Unclassified while under certain other context 
the document may be Classified. As an example, one could declassify an RDF 
document, once the war is over. Once XML and RDF have been secured the next 
step is to examine security for ontologies and interoperation. That is, ontologies 
may have security levels attached to them. The challenge is how does one use 
these ontologies for secure information integration. Researchers have done some 
work on the secure interoperability of databases. We need to revisit this research 
and then determine what else needs to be done so that the information on the 
web can be managed, integrated and exchanged securely. Closely related to se- 
curity is privacy. That is, certain portions of the document may be private while 
certain other portions may be public or semi-private. Privacy has received a lot 
of attention recently partly due to national security concerns. Privacy for the 
semantic web may be a critical issue, That is, how does one take advantage of 
the semantic web and still maintain privacy and sometimes anonymity. We also 
need to examine the inference problem for the semantic web. Inference is the 
process of posing queries and deducing new information. It becomes a problem 
when the deduced information is something the user is unauthorized to know. 
With the semantic web, and especially with data mining tools, one can make all 
kinds of inferences. That is the semantic web exacerbates the inference problem. 
Security should not be an afterthought. We have often heard that one needs 
to insert security into the system right from the beginning. Similarly, security 
cannot be an after-thought for the semantic web. However, we cannot also make 
the system inefficient if we must guarantee one hundred percent security at all 
times. What is needed is a flexible security policy. During some situations we 
may need one hundred percent security while during some other situations say 
thirty percent security (whatever that means) may be sufficient. 
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6 Conclusions 

In this paper we have focused on security and privacy issues for the semantic web. 
In particular, we have discussed these issues for two of the key components of 
semantic web, that is, web databases and services. Besides providing background 
information on web databases and services, we have discussed the main issues 
related to security and privacy: which are the main challenges, and which are 
the most promising solutions. Finally, we have discussed some of the issues in 
developing a secure semantic web. 
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Abstract. Peer-to-peer (P2P) systems are gaining increasing popularity 
as a scalable means to share data among a large number of autonomous 
nodes. In this paper, we consider the case in which the nodes in a P2P 
system store XML documents. We propose a fully decentralized approach 
to the problem of routing path queries among the nodes of a P2P sys- 
tem based on maintaining specialized data structures, called filters that 
efficiently summarize the content, i.e., the documents, of one or more 
node. Our proposed filters, called multi-level Bloom filters, are based 
on extending Bloom filters so that they maintain information about the 
structure of the documents. In addition, we advocate building a hierar- 
chical organization of nodes by clustering together nodes with similar 
content. Similarity between nodes is related to the similarity between 
the corresponding filters. We also present an efficient method for update 
propagation. Our experimental results show that multi-level Bloom filters 
outperform the classical Bloom filters in routing path queries. Further- 
more, the content-based hierarchical grouping of nodes increases recall, 
that is, the number of documents that are retrieved. 



1 Introduction 

The popularity of file sharing systems such as Napster, Gnutella and Kazaa has 
spurred much current attention to peer-to-peer (P2P) computing. Peer-to-peer 
computing refers to a form of distributed computing that involves a large number 
of autonomous computing nodes (the peers) that cooperate to share resources 
and services [1]. As opposed to traditional client-server computing, nodes in a 
P2P system have equal roles and act as both data providers and data consumers. 
Furthermore, such systems are highly dynamic in that nodes join or leave the 
system and change their content constantly. 

Motivated by the fact that XML has evolved as a standard for publishing and 
exchanging data in the Internet, we assume that the nodes in a P2P system store 
and share XML documents [23]. Such XML documents may correspond either 
to native XML documents or to XML-based descriptions of local services or 
datasets. Such datasets may be stored in local to each node databases supporting 
diverse data models and exported by the node as XML data. 
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A central issue in P2P computing is locating the appropriate data in these 
huge, massively distributed and highly dynamic data collections. Traditionally, 
search is based on keyword queries, that is, queries for documents whose name 
matches a given keyword or for documents that include a specific keyword. In 
this paper, we extend search to support path queries that exploit the structure of 
XML documents. Although data may exhibit some structure, in a P2P context, 
it is too varied, irregular or mutable to easily map to a fixed schema. Thus, our 
assumption is that XML documents are schema-less. 

We propose a decentralized approach to routing path queries among highly 
distributed XML documents based on maintaining specialized data structures 
that summarize large collections of documents. We call such data structures 
filters. In particular, each node maintains two types of filters, a local filter sum- 
marizing the documents stored locally at the node and one or more merged filters 
summarizing the documents of its neighboring nodes. Each node uses its filters 
to route a query only to those nodes that may contain relevant documents. Fil- 
ters should be small, scalable to a large number of nodes and documents and 
support frequent updates. 

Bloom filters have been used as summaries in such a context [2]. Bloom 
filters are compact data structures used to support keyword queries. However, 
Bloom filters are not appropriate for summarizing hierarchical data, since they 
do not exploit the structure of data. To this end, we introduce two novel multi- 
level data structures, Breadth and Depth Bloom filters, that support efficient 
processing of path queries. Our experimental results show that both multi-level 
Bloom filters outperform a same size traditional Bloom filter in evaluating path 
queries. We show how multi-level Bloom filters can be used as summaries to 
support efficient query routing in a P2P system where the nodes are organized 
to form hierarchies. Furthermore, we propose an efficient mechanism for the 
propagation of filter updates. Our experimental results show that the proposed 
mechanism scales well to a large number of nodes. 

In addition, we propose creating overlay networks of nodes by linking to- 
gether nodes with similar content. The similarity of the content (i.e., the local 
documents) of two nodes is related to the similarity of their filters. This is cost 
effective, since a filter for a set of documents is much smaller than the documents 
themselves. Furthermore, the filter comparison operation is more efficient than a 
direct comparison between sets of documents. As our experimental results show, 
the content-based organization is very efficient in retrieving a large number of 
relevant documents, since it benefits from the content clusters that are created 
when forming the network. 

In summary, the contribution of this paper is twofold: (i) it proposes using 
filters for routing path queries over distributed collections of schema-less XML 
documents and (ii) it introduces overlay networks over XML documents that 
cluster nodes with similar documents, where similarity between documents is 
related to the similarity between their filters. 

The remainder of this paper is structured as follows. Section 2 introduces 
multi-level Bloom filters as XML routers in P2P systems. Section 3 describes a 
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hierarchical distribution of filters and the mechanism for building content-based 
overlay networks based on filter similarity. Section 4 presents the algorithms for 
query routing and update propagation, while Section 5 our experimental results. 
Section 6 presents related research and Section 7 concludes the paper. 

2 Routers for XML Documents 

We consider a P2P system in which each participating node stores XML docu- 
ments. Users specify queries using path expressions. Such queries may originate 
at any node. Since it is not reasonable to expect that users know which node 
hosts the requested documents, we propose using appropriately distributed data 
structures, called filters, to route the query to the appropriate nodes. 

2.1 System Model 

We consider a P2P system where each node n, maintains a set of XML documents 
Di (a particular document may be stored in more than one node). Each node is 
logically linked to a relatively small set of other nodes called its neighbors. 



device 



color postscript digital 

(b) 

Fig. 1 . Example of (a) an XML document and (b) the corresponding tree 

In our data model, an XML document is represented by an unordered labeled 
tree, where tree nodes correspond to document elements, while edges represent 
direct element-subelement relationships. Figure 1 depicts an XML service de- 
scription for a printer and a camera provided by a node and the corresponding 
XML tree. Although, most P2P systems support only queries for documents that 
contain one or more keywords, we want also to query the structure of documents. 
Thus, we consider path queries that are simple path expressions in an XPath-like 
query language. 

Definition 1 . (path query) A path query of length p has the form ‘si l\ S 2 h 
. . . s p l p ” where each f is an element name and each Si is either /or // denoting 
respectively parent-child and ancestor-descendant traversal. 

A keyword query for documents containing keyword k is just the path query 
/ /k. For a query q and a document d, we say that q is satisfied by d, or match(d, 
q) is true, if the path expression forming the query exists in the document. 
Otherwise we have a miss. Nodes that include documents that match the query 
are called matching nodes. 




<xml> 

<device> 

<printer> 

<color></color> 

<postscript></postscript> 

</printer> 

<camera> 

<digital></digital> 

</camera> 

</device> 



(a) 
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2.2 Query Routing 

A given query may be matched by documents at various nodes. Thus, central 
to a P2P system is a mechanism for locating nodes with matching documents. 
In this regard, there are two types of P2P systems. In structured P2P systems, 
documents (or indexes of documents) are placed at specific nodes usually based 
on distributed hashing (such as in CAN [21] and Chord [20]). With distributed 
hashing, each document is associated with a key and each node is assigned a 
range of keys and thus documents. Although, structured P2P systems provide 
very efficient searching, they compromise node autonomy and in addition require 
sophisticated load balancing procedures. 

In unstructured P2P systems, resources are located at random points. Un- 
structured P2P systems can be further distinguished between systems that use 
indexes and those that are based on flooding and its variations. With flooding 
(such as in Gnutella [22]), a node searching for a document contacts its neigh- 
bor nodes which in turn contact their own neighbors until a matching node is 
reached. Flooding incurs large network overheads. In the case of indexes, these 
can be either centralized (as in Napster [8]), or distributed among the nodes (as 
in routing indexes [19]) providing for each node a partial view of the system. 

Our approach is based on unstructured P2P systems with distributed indexes. 
We propose maintaining as indexes specialized data structures, called filters, to 
facilitate propagating the query only to those nodes that may contain relevant 
information. In particular, each node maintains one filter that summarizes all 
documents that exist locally in the node. This is called a local filter. Besides 
its local filter, each node also maintains one or more filter, called merged filters, 
summarizing the documents of a set of its neighbors. When a query reaches a 
node, the node first checks its local filter and uses the merged filters to direct 
the query only to those nodes whose filters match the query. 

Filters should be much smaller than the data itself and should be lossless, 
that is if the data match the query, then the filter should match the query as 
well. In particular, each filter should support an efficient filter-match operation 
such that if a document matches a query q then filter-match should also be true. 
If the filter-match returns false, we say that we have a miss. 

Definition 2. (filter match) A filter F(D) for a set of documents D has the fol- 
lowing property: For any query q, if filter-match(q, F(D)) = false, then match(q, 
d) = false, V d € D. 

Note that, the reverse does not necessarily hold. That is, if filter-match (q, 
F(D)) = true, then there may or may not exist documents d € D such that 
match (5, d) is true. We call false positive the case in which, for a filter F(D) for 
a set of documents D , filter-match (q, F(D)) = true but there is no document d 
G D that satisfies q, that is V d € D, match(g, d) = false. We are interested in 
filters with small probability of false positives. 

Bloom filters are appropriate as summarizing filters in this context in terms 
of scalability, extensibility and distribution. However, they do not support path 
queries. To this end, we propose an extension called multi-level Bloom filters. 
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Multi-level Bloom filters were first presented in [17] where preliminary results 
were reported for their centralized use. To distinguish traditional Bloom filters 
from the extended ones, we shall call the former simple Bloom filters. Other hash- 
based structures, such as signatures [13], have similar properties with Bloom 
filters and our approach could also be applied to extend them in a similar fashion. 

2.3 Multi-level Bloom Filters 

Bloom filters are compact data structures for probabilistic representation of a 
set that support membership queries (“Is element a in set A?"). Since their 
introduction [3], Bloom filters have seen many uses such as web caching [4] and 
query filtering and routing [2,5]. Consider a set A = {oi, fl 2 ,..., a n } of n elements. 
The idea is to allocate a vector v of m bits, initially all set to 0, and then choose 
k independent hash functions, hi, / 12 , ■ ■ ., hk, each with range 1 to m. For each 
element a £ A, the bits at positions hi(a), / 12 (a), • • ., /ifc(a) in v are set to 1 
(Fig. 2). A particular bit may be set to 1 many times. Given a query for b, the 
bits at positions h\{b), h 2 {b), . . ., hk{b) are checked. If any of them is 0, then 
certainly b (j A. Otherwise, we conjecture that b is in the set although there is a 
certain probability that we are wrong. This is a false positive. It has been shown 
[3] that the probability of a false positive is equal to (1 — e ~ kn / m ^ k . To support 
updates of the set A we maintain for each location i in the bit vector a counter 
c(i) of the number of times that the bit is set to 1 (the number of elements that 
hashed to i under any of the hash functions). 






1 




1 


1 






1 







m = 10 bits 



Fig. 2. A (simple) Bloom filter with k = 4 hash functions 



Let T be an XML tree with j levels and let the level of the root be level 1. 
The Breadth Bloom Filter (BBF) for an XML tree T with j levels is a set of 
simple Bloom filters {BBFi, BBF 2 , . . . BBF,;}, i < j. There is one simple Bloom 
filter, denoted BBF,;, for each level i of the tree. In each BBFi, we insert the 
elements of all nodes at level i. To improve performance and decrease the false 
positive probability in the case of i < j, we may construct an additional Bloom 
filter denoted BBFo, where we insert all elements that appear in any node of 
the tree. For example, the BBF for the XML tree in Fig. 1 is a set of 4 simple 
Bloom filters (Fig. 3(a)). 

The Depth Bloom Filter (DBF) for an XML tree T with j levels is a set of 
simple Bloom filters {DBFo, DBFi, DBF 2 , . . ., DBFj_i}, i < j. There is one 
Bloom filter, denoted DBF,, for each path of the tree with length i, (i.e., a path 
of i + 1 nodes), where we insert all paths of length i. For example, the DBF for 
the XML tree in Fig. 1 is a set of 3 simple Bloom filters (Fig. 3(b)). Note that 
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(device U printer U camera U 
color U postscript U digital) 

device 

(printer U camera) 

(color U postscript U digital) 



(device U printer U camera U 
color U postscript U digital) 
(device/printer U device/camera 
U camera/digital U printer/color 
U printer/postscript) 
(device/camera/digital U 
device/printer/color U 
device/printer/postscript) 



(b) 



Fig. 3. The multi-level Bloom filters for the XML tree of Fig. 1: (a) the Breadth Bloom 
filter and (b) the Depth Bloom filter 



we insert paths as a whole; we do not hash each element of the path separately. 
We use a different notation for paths starting from the root. This is not shown 
in Fig. 3(b) for ease of presentation. 

The BBF filter-match operation (that checks whether a BBF matches a 
query) distinguishes between queries starting from the root and partial path 
queries. In both cases, if BBF 0 exists, the procedure checks whether it matches 
all elements of the query. If so, it proceeds to examine the structure of the path, 
else, it returns a miss. For a root query: / a\/ a^l ■■■/ a p , every level i from 1 to 
p of the filter is checked for the corresponding a.j. The procedure succeeds, if 
there is a match for all elements. For a partial path query, for every level i of 
the filter: the first element of the path is checked. If there is a match, the next 
level is checked for the next element and so on until either the whole path is 
matched or there is a miss. If there is a miss, the procedure repeats for level i + 
1. For paths with the ancestor-descendant axis / /, the path is split at the / / and 
the sub-paths are processed. The complexity of the BBF filter-match is 0(p 2 ) 
where p is the length (number of elements) of the query; in particular, for root 
queries the complexity is O (p). The DBF filter-match operation checks whether 
all sub-paths of the query match the corresponding filters; its complexity is also 
0(p 2 ). A detailed description of the filter match operations is given in [24]. 

3 Content-Based Linking 

In this section, we describe how the nodes are organized and how the filters are 
built and distributed among them. 
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3.1 Hierarchical Organization 

Nodes in a P2P system may be organized to form various topologies. In a hi- 
erarchical organization (Fig. 4), a set of nodes designated as root nodes are 
connected to a main channel that provides communication among them. The 
main channel acts as a broadcast mechanism and can be implemented in many 
different ways. A hierarchical organization is best suited when the participat- 
ing nodes have different processing and storage capabilities as well as varying 
stability, that is, some nodes stay longer online, while others stay online for a 
limited time. With this organization, nodes belonging to the top levels receive 
more load and responsibilities, thus, the most stable and powerful nodes should 
be located to the top levels of the hierarchies. 




Fig. 4. Hierarchical organization 



Each node maintains two filters: one summarizing its local documents, called 
local filter and, if it is a non-leaf node, one summarizing the documents of all 
nodes in its sub-tree, called merged filter. In addition, root nodes keep one merged 
filter for each of the other root nodes. The construction of filters follows a bottom- 
up procedure. A leaf node sends its local filter to its parent. A non-leaf node, after 
receiving the filters of all its children, merge them and produces its merged filter. 
Then, it merges the merged filter with its own local filter and sends the resulting 
filter to its parent. When a root computes its merged filter, it propagates it to 
all other root nodes. 

Merging of two or more multi-level filters corresponds to computing a bitwise 
OR (BOR) of each of their levels. That is, the merged filter, D, of two Breadth 
Bloom filters B and C with i levels is a Breadth Bloom filter with i levels: D 
= {Dq, Di, . . . Di}, where Dj = Bj BOR Cj, 0 < j < i. Similarly, we define 
merging for Depth Bloom filters. 

Although we describe a hierarchical organization, our mechanism can be 
easily applied to other node organizations as well. Preliminary results of the 
filters deployment in a non-hierarchical peer-to-peer system are reported in [18]. 



3.2 Content-Based Clustering 

Nodes may be organized in hierarchies based on their proximity at the under- 
lying physical network to exploit physical locality and minimize query response 
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time. The formation of hierarchies can also take into account other parame- 
ters such as administrative domains, stability and the different processing and 
storage capabilities of the nodes. Thus, hierarchies can be formed that better 
leverage the workload. However, such organizations ignore the content of nodes. 
We propose an organization of nodes based on the similarity of their content so 
that nodes with similar content are grouped together. The goal of such content- 
based clustering is to improve the efficiency of query routing by reducing the 
number of irrelevant nodes that process a query. In particular, we would like to 
optimize recall , that is the percentage of matching nodes that are visited during 
query routing. We expect that content-based clustering will increase recall since 
matching nodes will be only a few hops apart. 

Instead of checking the similarity of the documents themselves, we rely on 
the similarity of their filters. This is more cost effective, since a filter for a set of 
documents is much smaller than the documents. Moreover, the filter comparison 
operation is more efficient than a comparison between two sets of documents. 
Documents with similar filters are expected to match similar queries. 

Let B be a simple Bloom filter of size m. We shall use the notation B[i\, 1 < 
i < m to denote the ft li bit of the filter. Let two simple Bloom filters B and C of 
size m, their Manhattan (or Hamming) distance, d(B,C) is defined as d(B,C) 
= |£?[1] - (7[1]| + \B[2] - C[ 2] | + . . . + | B[m\ - (7[m]|, that is the number of bits 
that they differ. We define the similarity, of B and C as similarity (B , C) = m 
- d(B , C). The larger their similarity, the more similar the filters. In the case of 
multi-level Bloom filters, we take the sum of the similarities of each pair of the 
corresponding levels. 

We use the following procedure to organize nodes based on content similarity. 
When a new node n wishes to join the P2P system, it sends a join request that 
contains its local filter to all root nodes. Upon receiving a join request, each 
root node compares the received local filter with its merged filter and responds 
to n with the measure of their filter similarity. The root node with the largest 
similarity is called the winner root. Node n compares its similarity with the 
winner root to a system-defined threshold. If the similarity is larger than the 
threshold, n joins the hierarchy of the winner root, else n becomes a root node 
itself. In the former case, node n replies to the winner root that propagates its 
reply to all nodes in its sub-tree. The node connects to the node in the winner 
root’s subtree that has the most similar local filter. 

The procedure for creating content-based hierarchies effectively clusters 
nodes based on their content, so that similar nodes belong to the same hier- 
archy (cluster). The value of threshold determines the number of hierarchies in 
the system and affects system performance. Statistical knowledge, such as the 
average similarity among nodes, may be used to define threshold. We leave the 
definition of threshold and the dynamic adaptation of its value as future work. 

4 Querying and Updating 

We describe next how a query is routed and how updates are processed. 
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4.1 Query Routing 

Filters are used to facilitate query routing. In particular, when a query is issued 
at a node n, routing proceeds as follows. The local filter of node n is checked, 
and if there is a match, the local documents are searched. Next, the merged filter 
of n is checked, and if there is a match, the query is propagated to n’s children. 
The query is also propagated to the parent of the node. The propagation of a 
query towards the bottom of the hierarchy continues, until either a leaf node is 
reached, or the filter match with the merged filter of an internal node indicates 
a miss. The propagation towards the top of the hierarchy continues until the 
root node is reached. When a query reaches a root node, the root, apart from 
checking the filter of its own sub-tree, it also checks the merged filters of the 
other root nodes and forwards the query only to these root nodes for which 
there is a match. When a root node receives a query from another root it only 
propagates the query to its own sub-tree. 

4.2 Update Propagation 

When a document is updated or a document is inserted or deleted at a node, 
its local filter must be updated. An update can be viewed as a delete followed 
by an insert. When an update occurs at a node, apart from the update of its 
local filter, all merged filters that use this local filter must be updated. We 
present two different approaches for the propagation of updates based on the 
way the counters of the merged filters are computed. Note that in both cases 
we propagate the levels of the multi-level filter that have changed and not the 
whole multi-level filter. 

The straightforward way to use the counters at the merged filters is for every 
node to send to its parent, along with its filter, the associated counters. Then, 
the counters of the merged filter of each internal node are computed as the sum 
of the respective counters of its children’s filters. We call this method Count-Sum. 
An example with simple Bloom Filters is show in Fig. 5(a). Now, when a node 
updates its local filter and its own merged filter to represent the update, it also 
sends the differences between its old and new counter values to its parent. After 
updating its own summary, the parent propagates in turn the difference to its 
parent until all affected nodes are informed. In the worst case, in which an update 
occurs at a leaf node, the number of messages that need to be sent is equal to the 
number of levels in the hierarchy, plus the number of roots in the main channel. 

We can improve the complexity of update propagation by making the follow- 
ing observation: an update will only result in a change in the filter itself if the 
counter turns from 0 to 1 or vice versa. Taking this into consideration, each node 
just sends its merged filter to its parent (local filter for the leaf nodes) and not 
the counters. A node that has received all the filters from its children creates its 
merged filter as before but uses the following procedure to compute the counters: 
it increases each counter bit by one every time a filter of its children has a 1 
in the corresponding position. Thus, each bit of the counter of a merged filter 
represents the number of its children’s filters that have set this bit to 1 (and not 
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how many times the original filters had set the bit to 1). We call this method 
BitSum. An example with simple Bloom Filters is show in Fig. 5(c). When an 
update occurs, it is propagated only if it changes a bit from 1 to 0 or vice versa. 

An example is depicted in Fig. 5. Assume that node performs an update; 
as a result, its new (local) filter becomes (1, 0, 0, 1) and the corresponding 
counters (1, 0, 0, 2). With CountSum (Fig. 5(a)), will send the difference (-1, 
0, -1, -1) between its old and new counters to node ri 2 , whose (merged) filter 
will now become (1, 0, 1, 1) and the counters (2, 0, 1, 4). Node n 2 must also 
propagate the difference (-1, 0, -1, -1) to its parent n\ (although no change was 
reflected at its filter). The final state is shown in Fig. 5(b). With BitSum (Fig. 
5(c)), n 4 will send to n 2 only those bits that have changed from 1 to 0 and vice 
versa, that is (-, -1, -). The new filter of n ,2 will be (1, 0, 1, 1) and the counters 

(2, 0, 1, 2). Node n 2 does not need to send the update to n\. The final state is 
illustrated in Fig. 5(d). The BitSum approach sends fewer and smaller messages. 
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Fig. 5. An example of an update using CountSum and BitSum 



5 Experimental Evaluation 

We implemented the BBF (Breadth Bloom filter) and the DBF (Depth Bloom 
Filter) data structures, as well as a Simple Bloom filter (SBF) (that just hashes 
all elements of a document) for comparison. For the hash functions, we used 
MD5 [6]: a cryptographic message digest algorithm that hashes arbitrarily length 
strings to 128 bits. The k hash functions are built by first calculating the MD5 
signature of the input string, which yields 128 bits, and then taking k groups of 
128/fc bits from it. We used the Niagara generator [7] to generate tree-structured 
XML documents of arbitrary complexity. Three types of experiments are per- 
formed. The goal of the first set of experiments is to demonstrate the appro- 
priateness of multi-level Bloom filters as filters of hierarchical documents. To 
this end, we evaluate the false positive probability for both DBF and BBF and 
compare it with the false positive probability for a same size SBF for a variety of 
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query workloads and document structures. The second set of experiments focuses 
on the performance of Bloom filters in a distributed setting using both a content- 
based and a non content-based organization. In the third set of experiments , we 
evaluate the update propagation procedures. 

5.1 Simple versus Multi-level Bloom Filters 

In this set of experiments, we evaluate the performance of multi-level Bloom 
filters. As our performance metric, we use the percentage of false positives, since 
the number of nodes that will process an irrelevant query depends on it directly. 
In all cases, the filters compared have the same total size. Our input parameters 
are summarized in Table 1. In the case of the Breadth Bloom filter, we excluded 
the optional Bloom filter BBFo- The number of levels of the Breadth Bloom 
filters is equal to the number of levels of the XML trees, while for the Depth 
Bloom filters, we have at most three levels. There is no repetition of element 
names in a single document or among documents. Queries are generated by 
producing arbitrary path queries with 90% elements from the documents and 
10% random ones. All queries are partial paths and the probability of the // 
axis at each query is set to 0.05. 

Table 1. Input parameters 



Parameter 


Default Value 


Range 


■jk of XML documents 


200 


- 


Total size of filters 


78000 bits 


30000-150000 bits 


of hash functions 


4 


- 


# of queries 


100 


- 


of elements per document 50 


10-150 


of levels per document 


4/6 


2-6 


Length of query 


3 


2-6 


Distribution of query 


90% in documents 


elements 


10% random 


0%-10% 



Influence of filter size. In this experiment, we vary the size of the filters 
from 30000 bits to 150000 bits. The lower limit is chosen from the formula k 
= {m/n)ln2 that gives the number of hash functions k that minimize the false 
positive probability for a given size m and n inserted elements for an SBF: 
we solved the equation for m keeping the other parameters fixed. As our results 
show (Fig. 6(left)), both BBFs and DBFs outperform SBFs. For SBFs, increasing 
their size does not improve their performance, since they recognize as misses only 
paths that contain elements that do not exist in the documents. BBFs perform 
very well even for 30000 bits with an almost constant 6% of false positives, while 
DBFs require more space since the number of elements inserted is much larger 
than that of BBFs and SBFs. However, when the size increases sufficiently, the 
DBFs outperform even the BBFs. Note than in DBFs the number of elements 




40 



G. Koloniari and E. Pitoura 



inserted in each level i of the filter is about: 2d 1 + Sj =i+1 d^ , where d is the 
degree of the XML nodes and l the number of levels of the XML tree, while the 
corresponding number for BBFs is: d* _1 , which is much smaller. 

Using the results of this experiment, we choose as the default size of the filters 
for the rest of the experiments in this set, a size of 78000 bits, for which both 
our structures showed reasonable results. For 200 documents of 50 elements, this 
represents 2% of the space that the documents themselves require. This makes 
Bloom filters a very attractive summary to be used in a P2P computing context. 





SBF 

DBF — X— 
BBF 



Fig. 6. Comparison of Bloom filters: (left) filter size and (right) number of elements 
per document 



Influence of the number of elements per document. In this experiment, 
we vary the number of elements per document from 10 to 150 (Fig 6(right). 
Again, SBFs filter out only path expressions with elements that do not exist 
in the document. When the filter becomes denser as the elements inserted are 
increased to 150, SBFs fail to recognize even some of these expressions. BBFs 
show the best overall performance with an almost constant percentage of 1 to 
2% of false positives. DBFs require more space and their performance rapidly 
decreases as the number of inserted elements increases, and for 150 elements, 
they become worse than the SBFs, because the filters become overloaded (most 
bits are set to 1). 



Other Experiments. We performed a variety of experiments [24], Our exper- 
iments show that, DBFs perform well, although we have limited the number of 
their levels to 3 (we do not insert sub-paths of length greater than 3). This is 
because for each path expression of length p, the filter-match procedure checks 
all its possible sub-paths of length 3 or less; in particular, it performs (p - i + 
1) checks at every level i of the filter. In most cases, BBFs outperform DBFs for 
small sizes. However, DBFs perform better for a special type of queries. Assume 
an XML tree with the following paths: /a/b/c and /a/f/1, then a BBF would 
falsely match the following path: /a/b/1. However, DBFs would check all its 
possible sub-paths: /a/b/1, / a/b, /b/1 and return a miss for the last one. This 
is confirmed by our experiments that show DBFs to outperform BBFs for such 
query workloads. 
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5.2 Content-Based Organization 

In this set of experiments, we focus on filter distribution. Our performance met- 
ric is the number of hops for finding matching nodes. We simulated a network of 
nodes forming hierarchies and examined its performance with and without the 
deployment of filters and for both a content and a non content-based organiza- 
tion. First, we use simple Bloom filters and queries of length 1, for simplicity. In 
the last experiment, we use multi-level Bloom filters with path queries (queries 
with length larger than 1). We use small documents and accordingly small-sized 
filters. To scale to large documents, we just have to scale up the filter as well. 
There is one document at each node, since a large XML document corresponds 
to a set of small documents with respect to the elements and path expressions 
extracted. Each query is matched by about 10% of the nodes. For the content- 
based organization, the threshold is pre-set so that we can determine the number 
of hierarchies created. Table 2 summarizes our parameters. 



Table 2. Distribution parameters 



Parameter 


Default Value 


Range 


# of XML documents per node 1 


- 


Total size of filter 


200-800 


- 


# of queries 


100 


- 


# of elements per document 


10 


- 


# of levels per document 


4 


- 


Length of query 


1-2 


- 


Number of nodes 


100-200 


20-200 


Maximum number of hops 


First matching node found 


20-200 


Out-degree of a node 


2-3 


- 


Repetition between documents 


Every 10% of all docs 70% similar 


- 


Levels of hierarchy 


3-4 


- 


Matching nodes for a query 


10% of # of nodes 


1-50% 



Content vs. non content-based distribution. We vary the size of the net- 
work, that is, the number of participating nodes from 20 to 200. We measure 
the number of hops a query makes to find the first matching node. Figure 7 (left) 
illustrates our results. The use of filters improves query response. Without using 
filters, the hierarchical distribution performs worse than organizing the nodes in 
a linear chain (where the worst case is equal to the number of nodes), because of 
backtracking. The content-based outperforms the non content-based organiza- 
tion, since due to clustering of nodes with similar content, it locates the correct 
cluster (hierarchy) that contains matching documents faster. The number of 
hops remains constant as the number of nodes increases, because the number of 
matching nodes increases analogously. 

In the next experiment (Fig. 7(right)), we keep the size of the network fixed 
to 200 nodes and vary the maximum number of hops a query makes from 20 to 
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Fig. 7. Content vs non content-based organization: (left) finding the first matching 
node and (right) percentage of matching nodes found for a given number of hops 
(recall) 



200. Note that in the number of hops, the hops made during backtracking are 
also included. We are interested in recall, that is, the percentage of matching 
nodes that are retrieved (over all matching nodes) for a given number of nodes 
visited. Again, the approach without filters has the worst performance since 
it finds only about 50% of the results for even 200 hops. The content-based 
organization outperforms the non content-based one. After 50 hops, that is, 25% 
of all the nodes, it is able to find all matching nodes. This is because when the 
first matching node is found, the other matching nodes are located very close, 
since nodes with similar content are clustered together. 

We now vary the number of matching nodes from 1% to 50% of the total 
number of system nodes and measure the hops for finding the first matching 
node. The network size is fixed to 100 nodes. Our results (Fig. 8(left)) show that 
for a small number of matching nodes, the content-based organization outper- 
forms further the other ones. The reason is that it is able to locate easier the 
cluster with the correct answers. As the number of results increases both the 
network proximity and the filter-less approaches work well as it becomes more 
probable that they will find a matching node closer to the query’s origin since 
the documents are disseminated randomly. 




percentage of matching nodes maximum number of hops 



Fig. 8. (left) Number of hops to find the first result with varying number of matching 
nodes and (right) recall with multi-level filters 
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Using multi-level filters. We repeated the previous experiments using multi- 
level filters [24] and path queries of length 2. Our results confirm that multi-level 
Bloom filters perform better than simple Bloom filters in the case of path queries 
and for both a content and a non content-based organization. Figure 8(right) 
reports recall while varying the maximum number of hops. 



5.3 Updates 

In this set of experiments, we compare the performance of the CountSum and 
BitSum update propagation methods. We again simulated a network of nodes 
forming hierarchies and use Bloom filters for query routing. We used two met- 
rics to compare the two algorithms: the number and size of messages. Each node 
stores 5 documents and an update operation consists of the deletion of a docu- 
ment and the insertion of a new document in its place. The deleted document 
is 0% similar to the inserted document to inflict the largest change possible to 
the filter. Again, we use small documents and correspondingly small sizes for the 
filters. The origin of the update is selected randomly among the nodes of the 
system. Table 3 summarizes the parameters used. 



Table 3. Additional update propagation parameters 



Parameter 


Default Value Range 


# of XML documents per node 


5 


- 


Total size of filter 


4000 


- 


# of updates 


100 


- 


Number of nodes 


200 


20-200 


Repetition between deleted and inserted document 0% 


- 



Number and average size of messages. We vary the size of the network from 
20 to 200 nodes. We use both a content-based and a non content-based organi- 
zation and simple Bloom Filters. The BitSum method outperforms CountSum 
both in message complexity and average size of messages (Fig 9). The decrease in 
the number of messages is not very significant; however the size of the messages is 
reduced to half. In particular, CountSum creates messages with a constant size, 
while BitSum reduces the size of the message at every step of the algorithm. 
With a content-based organization, the number of messages increases with re- 
spect to the non content-based organization. This is because the content-based 
organization results in the creation of a larger number of more unbalanced hi- 
erarchies. However, both organizations are able to scale to a large number of 
nodes, since the hierarchical distribution of the filters enables performing up- 
dates locally. Thus, even for the content-based organization, less than 10% of 
the system nodes are affected by an update. 
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number of nodes 



number of nodes 



Fig. 9. BitSum vs CountSum update propagation: (left) number of messages (right) 
average message size 



Using multi-level Bloom filters. We repeat the previous experiment using 
multi-level Bloom filters as summaries. We use only the BitSum method that 
outperforms the CountSum method as shown by the previous experiment. We 
used Breadth (BBFs) and Depth Bloom Filters (DBFs), both for a content and 
a non content-based organization. The nodes vary from 20 to 200. The results 
(Fig. 10) show that BitSum works also well with multi-level Bloom filters. The 
content-based organization requires a larger number of messages because of the 
larger number of hierarchies created. DBFs create larger messages as the bits 
affected by an update are more. However with the use of BitSum, DBFs scale 
and create update messages of about 300 bytes (while for CountSum, the size 
is IK). 





BBF, content — i — 
BBF, non content — 
DBF, content 
DBF, non content □ 



number of nodes 



number of nodes 



Fig. 10. BitSum update propagation with multi-level Bloom filters: (left) number of 
messages (right) average message size 



6 Related Work 

We compare briefly our work with related approaches regarding XML indexes 
and the use of Bloom filters for query routing. A more thorough comparison can 
be found in [24] . Various indexing methods for indexing XML documents (such as 
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DataGuides [9], Patricia trees [10], XSKETCH [11] and signatures [12]) provide 
efficient ways of summarizing XML data, support complex path queries and offer 
selectivity estimations. However, these structures are centralized and emphasis 
is given on space efficiency and I/O costs. In contrast, in a P2P context, we are 
interested in small-size summaries of large collections of XML documents that 
can be used to filter out irrelevant nodes fast with the additional requirements 
that such summaries can be distributed efficiently. Finally, when compared to 
Bloom filters, merging and updating of path indexes is more complicated. 

Perhaps the resource discovery protocol most related to our approach is the 
one in [5] that uses simple Bloom filters as summaries. Servers are organized into 
a hierarchy modified according to the query workload to achieve load balance. 
Local and merged Bloom filters are used also in [14] , but the nodes follow no par- 
ticular structure. The merged filters include information about all nodes of the 
system and thus scalability issues arise. In both of the above cases, Bloom filters 
were used for keyword queries and not for XML data, whereas, our work sup- 
ports path queries. Furthermore, the use of filters was limited to query routing, 
while we extend their use to built content-based overlay networks. 

More recent research presents content-based distribution in P2P where nodes 
are “clustered” according to their content. With Semantic Overlay Networks 
(SONs) [15], nodes with semantically similar content are grouped based on a 
classification hierarchy of their documents. Queries are processed by identifying 
which SONs are better suited to answer it. However, there is no description 
of how queries are routed or how the clusters are created and no use of filter 
or indexes. An schema-based (RDF-based) peer-to-peer network is presented in 
[16]. The system can support heterogeneous metadata schemes and ontologies, 
but it requires a strict topology with lrypercubes and the use of super-peers, 
limiting the dynamic nature of the network. 



7 Conclusions and Future Work 

In this paper, we study the problem of routing path queries in P2P systems of 
nodes that store XML documents. We introduce two new hash-based indexing 
structures, the Breadth and Depth Bloom Filters, which in contrast to tradi- 
tional hash based indexes, have the ability to represent path expressions and 
thus exploit the structure of XML documents. Our experiments show that both 
structures outperform a same size simple Bloom Filter. In particular, for only 
2% of the total size of the documents, multi-level Bloom filters can provide effi- 
cient evaluation of path queries for a false positives ratio below 3%. In general 
Breadth Bloom filters work better than Depth Bloom filters, however Depth 
Bloom filters recognize a special type of path queries. In addition, we introduce 
BitSum, an efficient update propagation method that significantly reduces the 
size of the update messages. Finally, we present a hierarchical organization that 
groups together nodes with similar content to improve search efficiency. Con- 
tent similarity is related to similarity among filters. Our performance results 
confirm that an organization that performs a type of content clustering is much 
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more efficient when we are interested in retrieving a large number of relevant 
documents. 

An interesting issue for future work is deriving a method for self-organizing 
the nodes by adjusting the threshold of the hierarchies. Other important top- 
ics include alternative ways for distributing the filters besides the hierarchical 
organization and using other types of summaries instead of Bloom filters. 
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Abstract. A location-based service (LBS) provides information based 
on the location information specified in a query. Nearest-neighbor 
(NN) search is an important class of queries supported in LBSs. This 
paper studies energy-conserving air indexes for NN search in a wireless 
broadcast environment. Linear access requirement of wireless broadcast 
weakens the performance of existing search algorithms designed for 
traditional spatial database. In this paper, we propose a new energy- 
conserving index, called grid-partition index , which enables a single 
linear scan of the index for any NN queries. The idea is to partition 
the search space for NN queries into grid cells and index all the objects 
that are potential nearest neighbors of a query point in each grid 
cell. Three grid partition schemes are proposed for the grid-partition 
index. Performance of the proposed grid-partition indexes and two 
representative traditional indexes (enhanced for wireless broadcast) is 
evaluated using both synthetic and real data. The result shows that the 
grid-partition index substantially outperforms the traditional indexes. 

Keywords: mobile computing, location-based services, energy- 

conserving index, nearest-neighbor search, wireless broadcast 



1 Introduction 

Due to the popularity of personal digital devices and advances in wireless com- 
munication technologies, location-based services (LBSs) have received a lot of 
attention from both of the industrial and academic communities [9] . In its report 
“IT Roadmap to a Geospatial Future” [14], the Computer Science and Telecom- 
munications Board (CSTB) predicted that LBS will usher in the era of pervasive 
computing and reshape mass media, marketing, and various aspects of our so- 
ciety in the decade to come. With the maturation of necessary technologies and 
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the anticipated worldwide deployment of 3G wireless communication infrastruc- 
ture, LBSs are expected to be one of the killer applications for wireless data 
industry. 

In the wireless environments, there are basically two approaches for provision 
of LBSs to mobile users: 1 

— On-Demand Access: A mobile user submits a request, which consists of a 
query and its current location, to the server. The server locates the requested 
data and returns it to the mobile user. 

— Broadcast: Data are periodically broadcast on a wireless channel open to 
the public. Instead of submitting a request to the server, a mobile user tunes 
into the broadcast channel and filters out the data based on the query and 
its current location. 

On-demand access employs a basic client-server model where the server is re- 
sponsible for processing a query and returning the result directly to the user via 
a dedicate point-to-point channel. On-demand access is particularly suitable for 
light-loaded systems when contention for wireless channels and server processing 
is not severe. However, as the number of users increases, the system performance 
deteriorates rapidly. On the other hand, wireless broadcast, which has long been 
used in the radio and TV industry, is a natural solution to solve the scalability 
and bandwidth problems in pervasive computing environments since broadcast 
data can be shared by many clients simultaneously. For many years, compa- 
nies such as Hughes Network System have been using satellite-based broadcast 
to provide broadband services. The smart personal objects technology (SPOT), 
announced by Microsoft at the 2003 International Consumer Electronics Show, 
has further ascertained the industrial interest on and feasibility of utilizing wire- 
less broadcast for pervasive data services. With a continuous broadcast network 
(called Direct-Band Network) using FM radio subcarrier frequencies, SPOT-based 
devices such as watches, alarms, etc., can continuously receive timely, location- 
specific, personalized information [5]. Thus, in this paper, we focus on supporting 
LBSs in the wireless broadcast systems. 

A very important class of problems in LBSs is nearest-neighbor (NN) search. 
An example of a NN search is: “Show me the nearest restaurant.” A lot of 
research has been carried out on how to solve the NN search problem for spatial 
databases [13]. Most of the existing studies on NN search are based on indexes 
that store the locations of the data objects (e.g., the well-known R-tree [13]). 
We call them object-based indexes. Recently, Berchtold et. al. proposed a method 
for NN search based on indexing the pre-computed solution space [1]. Based 
on the similar design principle, a new index, called D-tree, was proposed by 
the authors [15]. We refer this category of indexes as solution-based indexes. 
Both of object-based indexes and solution-based indexes have some advantages 
and disadvantages. For example, object-based indexes have a small index size, 
but they sometimes require backtracking to obtain the result. This only works 

1 In this paper, mobile users, mobile clients, and mobile devices are used interchange- 
ably. 
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for random data access media such as disks but, as shown later in this paper, 
does not perform well on broadcast data. Solution-based indexes overcome the 
backtracking problem and thus work well for both random and sequential data 
access media. However, they in general perform well in high-dimensional space 
but poorly in low-dimensional space, since the solution space generally consists 
of many irregular shapes to index. 

The goal of this study is to design a new index tailored for supporting NN 
search on wireless broadcast channels. Thus, there are several basic requirements 
for such a design: 1) The index can facilitate energy saving at mobile clients; 2) 
The index is access and storage efficient (because the index will be broadcast 
along with the data); 3) The index is flexible (i.e., tunable based on a weight 
between energy saving and access latency; and 4) A query can be answered within 
one linear scan of the index. Based on the above design principles, we propose a 
new energy-conserving index called Grid- Partition Index which novelly combines 
the strengths of object-based indexes and solution-based indexes. Algorithms 
for constructing the grid-partition index and processing NN queries in wireless 
broadcast channel based on the proposed index are developed. 

The rest of this paper is organized as follows. Section 2 introduces air indexing 
for a wireless broadcast environment and reviews existing index structures for 
NN search. Section 3 explains the proposed energy-conserving index, followed by 
description of three grid partition schemes in Section 4. Performance evaluation 
of the Grid-Partition index and two traditional indexes is presented in Section 5. 
Finally, we conclude the paper with a brief discussion on the future work in 
Section 6. 

2 Background 

This study focuses on supporting the NN search in wireless broadcast environ- 
ments, in which the clients are responsible for retrieving data by listening to the 
wireless channel. In the following, we review the air indexing techniques for wire- 
less data broadcast and the existing index structures for NN search. Throughout 
this paper, the Euclidean distance function is assumed. 

2.1 Air Indexing Techniques for Wireless Data Broadcast 

One critical issue for mobile devices is the consumption of battery power [3,8,11]. 
It is well known that transmitting a message consumes much more battery power 
than receiving a message. Thus, data access via broadcast channel is more energy 
efficient than on-demand access. However, by only broadcasting the data objects, 
a mobile device may have to receive a lot of redundant data objects on air before 
it finds the answer to its query. With increasing emphasis and rapid development 
on energy conserving functionality, mobile devices can switch between doze mode 
and active mode in order to conserve energy consumption. Thus, air indexing 
techniques, aiming at energy conservation, are developed by pre-computing and 
indexing certain auxiliary information (i.e., the arrival time of data objects) 
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Fig. 1. Data and Index Organization Using the (1, m) Interleaving Technique 



for broadcasting along with the data objects [8]. By first examining the index 
information on air, a mobile client is able to predict the arrival time of the 
desired data objects. Thus, it can stay in the doze mode to save energy most of 
the time and only wake up in time to tune into the broadcast channel when the 
requested data objects arrive. 

A lot of existing research focuses on organizing the index information with 
data objects in order to improve the service efficiency. A well-known data and 
index organization for wireless data broadcast, called (l,m) interleaving tech- 
nique [8]. As shown in Figure 1, a complete index is broadcasted preceding every 
— fraction of the broadcast cycle, the period of time when the complete set 
of data objects is broadcast. By replicating the index for m times, the waiting 
time for a mobile device to access the forthcoming index can be reduced. The 
readers should note that this interleaving technique can be applied to any index 
designed for wireless data broadcast. Thus, in this paper, we employ the (1 , m) 
interleaving scheme for our index. 

Two performance metrics are typically used for evaluation of air indexing 
techniques: tuning time and access latency. The former means the period of 
time a client staying in the active mode, including the time used for searching 
the index and the time used for downloading the requested data. Since the 
downloading time of the requested data is the same for any indexing scheme, we 
only consider the tuning time used for searching the index. This metric roughly 
measures the power consumption by a mobile device. To provide a more precise 
evaluation, we also use power consumption as a metric in our evaluation. The 
latter represents the period of time from the moment a query is issued until the 
moment the query result is received by a mobile device. 



2.2 Indexes for NN Search 

There is a lot of existing work on answering NN search in the traditional spatial 
databases. As mentioned earlier, existing indexing techniques for NN search can 
be categorized into object-based index and solution-based index. Figure 2(a) 
depicts an example with four objects, oi, 02, 03, and 04, in a search space A. This 
running example illustrates different indexing techniques discussed throughout 
this paper. 
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Fig. 2. A Running Example and R-tree Index 

Object-Based Index. The indexes in this category are built upon the loca- 
tions of data objects. R.-tree is a representative [7]. Figure 2(c) shows R.-tree 
for the running example. To perform NN search, a branch-and-bound approach 
is employed to traverse the index tree. At each step, heuristics are employed 
to choose the next branch for traversal. At the same time, information is col- 
lected to prune the future search space. Various search algorithms differ in the 
searching order and the metrics used to prune the branches [4,13]. 

Backtracking is commonly used in search algorithms proposed for traditional 
disk-access environment. However, this practice causes a problem for the linear- 
access broadcast channels. In wireless broadcast environments, index information 
is available to the mobile devices only when it is on the air. Hence, when an 
algorithm retrieves the index packets in an order different from their broadcast 
sequence, it has to wait for the next time the packet is broadcast (see the next 
paragraph for details). In contrast, the index for traditional databases is stored 
in resident storages, such as memories and disks. Consequently, it is available 
anytime. 

Since the linear access requirement is not a design concern of traditional index 
structures, the existing algorithms do not meet the requirement of energy effi- 
ciency. For example, the index tree in Figure 2(c) is broadcast in the sequence of 
root, Ri, and R 2 . Given a query point p 2 , the visit sequence (first root, then R 2 , 
finally Ri) results in a large access latency, as shown in Figure 3(a). Therefore, 
the branch-and-bound search approach is inefficient in access latency. Alterna- 
tively, we may just access the MBRs sequentially (see Figure 3(b)). However, this 
method is not the best in terms of index search performance since unnecessary 
MBR traversals may be incurred. For example, accessing Ri for qi is a waste of 
energy. 
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Fig. 3. Linear Access on a Wireless Broadcast Channel 
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Solution-Based Index. The indexes in this category are built on the pre- 
computed solution space, rather than on the objects [1]. For NN search, the 
solution space can be represented by Voronoi Diagrams (VDs) [2], Let O = 
{or, 02, • • • , o n } be a set of points. V(oi), the Voronoi Cell (VC) for a, is defined 
as the set of points q in the space such that dist(q, Oi ) < dist(q, Oj ), V j ^ i. The 
VD for the running example is depicted in Figure 4(a), where Pi, P2, P3, and 
P4 denote the VCs for the four objects, 01,02,03, and 04, respectively. 

With a solution-based index, the NN search problem can be reduced to the 
problem of determining the VC in which a query point is located. Our previously 
proposed index, D-tree, has demonstrated a better performance for indexing 
solution-space than traditional indexes, and hence is employed as a representa- 
tive of indexes in this category [15]. D-tree indexes VCs based on the divisions 
that form the boundaries of the VCs. For a space containing a set of VCs, D-tree 
recursively partitions it into two sub-spaces having similar number of VCs until 
each space only contains one VC 2 . D-tree for the running example is shown in 
Figure 4(b). 

In summary, existing index techniques for NN search are not suitable for 
wireless data broadcast. An object-based index incurs a small index size, but the 
tuning time is poor because random data access is not allowed in a broadcast 
channel. On the other hand, a solution-based index, typically used for NN search 
in a high dimensional space, does not perform well in a low dimensional space 
due to the fact that efficient structures for indexing VCs are not available. In the 
following, we propose a new energy-conserving index that combines the strengths 
of both the object-based and the solution-based indexes. 




(a) Divisions 



(b) D-tree 



Fig. 4. D-tree Index for the Running Example 



3 A Grid-Partition Index 

In this section, we first introduce the basic idea of our proposal and then describe 
the algorithm for processing NN search based on the Grid-Partition air index. 

2 D-tree was proposed to index any pre-computed solution space, not only for NN 
search. 
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3.1 The Basic Idea 

In an object-based index such as R-tree, each NN search starts with the whole 
search space and gradually trims the space based on some knowledge collected 
during the search process. We observe that an essential problem leading to the 
poor search performance of object-based index is the large overall search space. 
Therefore, we attempt to reduce the search space for a query at the very be- 
ginning by partitioning the space into disjointed grid cells. To do so, we first 
construct the whole solution space for NN search using the VD method; then, 
divide the search space into disjointed grid cells using some grid partition algo- 
rithm (three partition schemes will be discussed in Section 4). For each grid cell, 
we index all the objects that are potential nearest neighbors of a query point in- 
side the grid cell. Each object is the nearest neighbor only to those query points 
located inside the VC of the object. Hence, for any query point inside a grid cell, 
only the objects whose VCs overlap with the grid cell need to be checked. 

Definition 1 An object is associated with a grid cell if the VC of the object 
overlaps with the grid cell. 

Since each grid cell covers a part of the search space only, the number of 
objects associated with each grid cell is expected to be much smaller than the 
total number of objects in the original space. Thus, the initial search space for 
a NN query is reduced greatly if we can quickly locate the grid cell in which a 
query point lies. Hence, the overall performance is improved. Figure 5(a) shows 
a possible grid partition for our running example. The whole space is divided 
into four grid cells, i.e., G\, G 2 , G 3 , and G 4 . Grid cell G\ is associated with 
objects 01 and 02 since their VCs, P\ and P2, overlap with Gi; likewise, grid cell 
G2 is associated with objects 01,02, and 03, and so on and so forth. 

The index structure for the proposed grid-partition index consists of two 
levels. The upper- level index is built upon the grid cells, and the lower-level 
index is upon the objects associated with each grid cell. The upper-level index 
maps a query point to a corresponding grid cell, while the lower-level index 
facilitates the access to the objects within each grid cell. The nice thing is that 
once the query point is located in a grid cell, its nearest neighbor is definitely 
among the objects associated with that grid cell, thereby preventing any rollback 





(a) FP 



(b) Index Structure 



Fig. 5. Fixed Grid Partition for the Running Example 
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operations and enabling a single linear access of the upper-level index for any 
query point. In addition, to avoid rollback operations in the lower- level index, we 
try to maintain each grid cell in a size such that its associated objects can fit into 
one data packet, which is the smallest transmission unit in wireless broadcast. 
Thus, for each grid cell a simple index structure (i.e. , a list of object and pointer 
pairs) is employed. In case that the index for a grid cell cannot fit into one 
packet, a number of packets are sequentially allocated. In each grid cell, the list 
of object and pointer pairs are sorted according to the dimension with the largest 
scale (hereafter called sorting dimension ), i.e., the dimension in which the grid 
cell has the largest span. For example, in Figure 5(a), the associated objects for 
grid cell Gn are sorted according to the y-dimension, i.e., oi, 03, and 02. This 
arrangement is to speed up the nearest neighbor detecting procedure as we will 
see in the next subsection. 

3.2 Nearest-Neighbor Search 

With a grid-partition index, a NN query is answered by executing the following 
three steps: 1) locating grid cell , 2) detecting nearest neighbor , and 3) retrieving 
data. The first step locates the grid cell in which the query point lies. The second 
step obtains all the objects associated with that grid cell and detects the nearest 
neighbor by comparing their distances to the query point. The final step retrieves 
the data to answer the query. In the following, we describe an efficient algorithm 
for detecting the nearest neighbor in a grid cell. This algorithm works for all 
the proposed grid partition schemes. We leave the issue of locating grid cell to 
Section 4. This allows us to treat the problems of partitioning grid and locating 
grid cells more coherently. 

In a grid cell, given a query point, the sorted objects are broken into two 
lists according to the query point in the sorting dimension: one list consists of 
the objects with coordinates smaller than the query point, and the rest form the 
other. To detect the nearest neighbor, the objects in those two lists are checked 
alternatively. Initially, the current shortest distance min.dis is set to infinite. 
At each checking step, mirndis is updated, if the distance between the object 
being checked and the query point, cur_dis, is shorter than mimdis. The checking 
process continues until the distance of the current object and the query point in 
the sorting dimension, dissd , is longer than min-dis. The correctness is justified 
as follows. For the current object, its cur-dis is longer than or equal to dissd and, 
hence, longer than mimdis if dissd is longer than mirndis. For the remaining 
objects in the list, their dissd' s are even longer and, thus, it is impossible for 
them to have a distance shorter than mimdis. 

Figure 6(a) illustrates an example, where nine objects associated with the 
grid cell are sorted according to the x-dimension since the grid cell is flat. Given 
a query point shown in the figure, nine objects are broken into two lists, with one 
containing oq to o\ and the other containing 07 to o 9. The algorithm proceeds 
to check these two lists alternatively, i.e., in the order of oq, 07, 05, os, ■ ■ ■ , and 
so on. Figure 6(b) shows the intermediate results for each step. In the first list, 
the checking stops at 04 since its dissd (i.e., 7.5) is already longer than min^dis 
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Fig. 6. An Example for Detecting Nearest Neighbor 



(i.e., 6). Similarly, the checking stops at o 8 in the second list. As a result, only 
five objects rather than all nine objects are evaluated. Such improvement is 
expected to be significant when the scales of a grid cell in different dimensions 
differ greatly. 

4 Grid Partitions 

Thus far, the problem of NN search has been reduced to the problem of grid 
partition. How to divide the search space into grid cells, construct the upper-level 
grid index, and map a query point into a grid cell are crucial to the performance 
of the proposed index. In this section, three grid partition schemes, namely, 
fixed partition, semi-adaptive partition, and adaptive partition, are introduced. 
These schemes are illustrated in a two-dimensional space, since we focus on the 
geospatial world (2-D or 3-D space) in the real mobile environments. 

Before presenting grid partition algorithms, we first introduce an important 
performance metric, indexing efficiency 77, which is employed in some of the 
proposed grid partition schemes. It is defined as the ratio of the reduced tun- 
ing time to the enlarged index storage cost against a naive scheme, where the 
locations of objects are stored as a plain index that is exhaustively searched 
to answer a NN query. The indexing efficiency of a scheme i is defined as 

7/(f) = ((T na iv e Tif) /Tnaiv e^j / ^(*5) ^naive) I ^naive^j 7 where T is the aver- 

age tuning time, S is the index storage cost, and a is a control parameter to 
weigh the importance of the saved tuning time and the index storage overhead. 
The setting of a could be adjusted for different application scenarios. The larger 
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the a value, the more important the timing time compared with the index stor- 
age cost. This metric will be used as a performance guideline to balance the 
tradeoff between the tuning time and the index overhead in constructing the 
grid-partition index. 

4.1 Fixed Partition (FP) 

A simple way for grid partition is to divide the search space into fixed-size grid 
cells. Let S x and S y be the scales of the x- and y-dimensions in the original 
space, g x and g y be the fixed width and height of a grid cell. The original space 
is thus divided into — ■ — grid cells. With this approach, the upper-level index 

9x 9y 

for the grid cells (shown in Figure 5(b)) maintains some header information 
(i.e., S x , g x , and g v ) to assist in locating grid cells, along with a one-dimensional 
array that stores the pointers to the grid cells. In the data structure, if the header 
information and the pointer array cannot fit into one packet, they are allocated 
in a number of sequential packets. 

The grid cell locating procedure works as follows. We first access the header 
information and get the parameters of S x ,g x , and g y . Then, given a query point 
{q x ,q y ), we use a mapping function, adr{q x , q y )=[j L \ • + [j L \, to calculate 

the address of the pointer for the grid cell in which the query point lies. Hence, 
at most 2-packet accesses (one for the header information and maybe additional 
one for the pointer if it is not allocated in the same packet) in locating grid cells 
are needed, regardless of the number of grid cells and the packet size. 

Aiming to maximize the packet utilization in the index, we employ a greedy 
algorithm to choose the best grid size. Let num be the number of expected 
grid cells. We continue to increase num from 1 until the average number of 
objects associated with the grid cells is smaller than the fan-out of a node. 
Further increasing num will decrease the packet occupancy and thus degrade 
the performance. For any num, every possible combination of g x and g y such 
that ^ equals num , is considered. The indexing efficiency for the resultant 
grid partition with width g x and height g y is calculated. The grid partition 
achieving the highest indexing efficiency is selected as the final solution. 

While the fixed grid partition is simple, it does not take into account the 
distribution of objects and their VCs. Thus, it is not easy to utilize the index 
packets efficiently, especially under a skewed object distribution. Consequently, 
under the fixed grid partition, it is not unusual to have some packets with a low 
utilization rate, whereas some others overflow. This could lead to a poor average 
performance. 

4.2 Semi- Adaptive Partition (SAP) 

To adapt to skewed object distributions, the semi-adaptive partition only fixes 
the size of the grid cells in either width or height. In other words, the whole space 
is equally divided into stripes along one dimension. In the other dimension, each 
stripe is partitioned into grid cells in accordance with the object distribution. 
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(b) Index Structure 



Fig. 7. Semi-Adaptive Partition for the Running Example 



The objective is to increase the utilization of a packet by having the number 
of objects associated with each grid cell close to the fan-out of the node. Thus, 
once the grid cell is identified for a query point, only one packet needs to be 
accessed in the step of detecting nearest neighbor. 

Figure 7 illustrates the semi-adaptive grid partition for our running example, 
where the height of each stripe is fixed. Similar to the FP approach, the root 
records the width of a stripe (i.e. , s x ) for the mapping function and an array of 
pointers pointing to the stripes. In each stripe, if the associated objects can fit 
into one packet, the objects are allocated directly in the lower-level index (e.g., 
the 1st and 4tlr pointers in the root). Otherwise, an extra index node for the 
grid cells within the corresponding stripe is allocated (e.g., the 3rd pointer in the 
root). The extra index node consists of a set of sorted discriminators followed 
by the pointers pointing to the grid cells. However, if there is no way to further 
partition a grid cell such that the objects in each grid cell can fit the packet 
capacity, more than one packet is allocated (e.g., the 2nd pointer in the root). 

To locate the grid cell for a query point (q x , q v ), the algorithm first locates the 
desired stripe using a mapping function, adr(q x ,q y )= |_— J- If the stripe points 
to an object packet (i.e., only contains one grid cell), it is finished. Otherwise, 
we traverse to the extra index node and use the discriminators to locate the 
appropriate grid cell. Compared with FP, this partition approach has a better 
packet occupancy, but takes more space to index the grid cells. 



4.3 Adaptive Partition (AP) 

The third scheme adaptively partition the grid using a kd-tree like partition 
method [12]. It recursively partitions the search space into two complementary 
subspaces such that the number of objects associated with the two subspaces 
is nearly the same. The partition does not stop until the number of objects 
associated with each subspace is smaller than the fan-out of the index node. 

The partition algorithm works as follows. We partition the the space hori- 
zontally or vertically. Suppose that the vertical partition is employed. We sort 
the objects in an increasing order of the left-most x-coordinates (LXs) of their 
VCs. Then, we examine the LXs one by one beginning from the median ob- 
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(b) Index Structure 
Fig. 8. Adaptive Partition for the Running Example 



ject. Given a LX, the space is partitioned by the vertical line going through 
the LX. Let numi and num r be the numbers of associated objects for the left 
subspace and the right subspace, respectively. If numi=num r , the examination 
stops immediately. Otherwise, all the LXs are tried and the LX resulting in the 
smallest value of \numi-num r \ is selected as the discriminator. Similarly, when 
the horizontal partition is employed, the objects are sorted according to the 
lowest y-coordinates (LYs) of their VCs and the discriminator is selected much 
the same way in the vertical partition. In selecting the partition style between 
the vertical and horizontal partitions, we favor the one with a smaller value of 
\numi+num r \. Figure 8 shows the adaptive grid partition for the running exam- 
ple, where each node in the upper-level index stores the discriminator followed 
by two pointers pointing to two subspaces of the current space. 

In this approach, the index for the grid cells is a kd-tree. Thus, the point 
query algorithm for the kd-tree is used to locate the grid cells. Given a query 
point, we start at the root. If it is to the left of the discriminator of the node, the 
left pointer is followed; otherwise, the right pointer is followed. This procedure 
is not stopped until a leaf node is met. However, as the kd-tree is binary, we 
need some paging method to store it in a way to fit the packet size. A top-down 
paging mechanism is employed. The binary kd-tree is traversed in a breadth-first 
order. For each new node, the packet containing its parent is checked. If that 
packet has enough free space to contain this node, the node is inserted into that 
packet. Otherwise, a new packet is allocated. 



4.4 Discussion 

For NN search, the VD changes when objects are inserted, deleted, or relocated. 
Thus, the index needs to be updated accordingly. Since updates are expected to 
happen infrequently regarding NN search in mobile LBS applications (such as 
finding nearest restaurant and nearest hotel), we only briefly discuss the update 
issue here. 

When an object Oi is inserted or deleted, the VCs around Oi will be affected. 
The number of affected VCs is approximately the number of edges of the VC 
for Oi, which is normally very small. For example, in Figure 2(a) adding a new 
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object in P\ changes the VCs for oi, 02 , and 03 . Assuming the server maintains 
adequate information about the VCs, the affected VCs can be detected easily. 

For all the proposed grid partitions, we can identify the grid cells that over- 
lap with the affected VCs and update the index for the objects associated with 
each of them. When updates are rare, the partition of the grid cells is not mod- 
ified. On the other hand, partial or complete grid re-partition can be performed 
periodically. 



5 Performance Evaluation 

To evaluate the proposed grid-partition index, we compare it with D-tree [15] 
and R-tree, which represent the solution-based index and the object-based index 
for NN search respectively, in terms of tuning time, power consumption, and 
access latency. Two datasets (denoted as UNIFORM and REAL) are used in the 
evaluation (see Figure 9). In the UNIFORM dataset, 10,000 points are uniformly 
generated in a square Euclidean space. The REAL dataset contains 1102 parks in 
the Southern California area, which is extracted from the point dataset available 
from [ 6 ]. 

Since the data objects are available a priori, the STR packing scheme is em- 
ployed to build R.-tree [10]. As we discussed in Section 2.2, the original branch- 
and-bound NN search algorithm results in a poor access latency in wireless 
broadcast systems. In order to cater for the linear-access requirement on air, we 
revise it as follows. R.-tree is broadcast in a width-first order. For query pro- 
cessing, no matter where the query point is located, the MBRs are accessed 
sequentially, while impossible branches are pruned similarly in the original algo- 
rithm [13]. 

The system model in the simulation consists of a base station, a number of 
clients, and a broadcast channel. The available bandwidth is set to 100A' bps. 
The packet size is varied from 64 bytes to 2048 bytes. In each packet, two bytes 
are allocated for the packet id. Two bytes are used for one pointer and four bytes 
are for one coordinate. The size of a data object is set to IK bytes. The results 




(a) UNIFORM (b) REAL 

Fig. 9. Datasets for Performance Evaluation 
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presented in the following sections are the average performance of 10, 000, 000 
random queries. 

5.1 Sensitivity to Indexing Efficiency 

Indexing efficiency has been used in the FP and SAP grid partition schemes 
as guidance for determining the best cell partition. The control parameter a of 
indexing efficiency, set to a non-negative number, weighs the importance of the 
saved packet accesses and the index overhead. We conduct experiments to test 
the sensitivity of tuning time and index size to a. 

Figure 10 shows the performance of grid-partition index for the UNIFORM 
dataset when the FP partition scheme is employed. Similar results are obtained 
for the REAL dataset and/or other grid partitions and, thus, are omitted due to 
the space limitation. From the figure, we can observe that the value of a has a 
significant impact on the performance, especially for small packet capacities. In 
general, the larger the value of a, the better the tuning time and the worse the 
index storage cost since a larger a value assigns more weight to reducing tuning 
time. As expected, the best index overhead is achieved when a is set to 0, and 
the best tuning time is achieved when a is set to infinity. The setting of a can 
be adjusted based on requirements of the applications. The index overhead for 
air indexes is also critical as it directly affects the access latency. Thus, for the 
rest of experiments, the value of a is set to 1, giving equal weight to the index 
size and the tuning time. 





(a) Index Size (b) Tuning Time 

Fig. 10. Performance under Different a Settings (UNIFORM, FP) 



5.2 Tuning Time 

This subsection compares the different indexes in terms of tuning time. In the 
wireless data broadcast environment, improving the tuning time generally saves 
power consumption. Figures 11(a) and (b) show the tuning time performance of 
compared indexes under UNIFORM and REAL datasets, respectively. 
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(a) UNIFORM 



(b) REAL 



Fig. 11. Tuning Time vs. Packet Size 



Two observations are obtained. First, the proposed grid-partition indexes 
outperform both D-tree and R-tree in most cases. As an example, let’s look 
at the case when the packet size is 512 bytes. For D-tree, the tuning time is 
0.59ms and 0.43ms for UNIFORM and REAL datasets, respectively. For R- 
tree, it needs 3.12ms and 0.43ms, respectively. The grid-partition indexes have 
the best performance, i.e., no larger than 0.27 ms for UNIFORM dataset and no 
larger than 0.19?ns for REAL dataset. 

Second, among the three proposed grid partition schemes, the SAP has the 
best overall performance and is the most stable one. The main reason is that the 
SAP scheme is more adaptive to the distribution of the objects and their VCs 
than the FP scheme, while its upper-level index (i.e., the index used for locating 
grid cells) is a simpler and more efficient data structure than that of the AP 
scheme. As a result, in most cases the SAP accesses only one or two packets to 
locate grid cells and another one to detect the nearest neighbor. 

We notice that the grid-partition indexes with FP and SAP work worse than 
D-tree when the packet size is 64 bytes. This is caused by the small capacity of 
packet, which can fit in very limited objects information. Hence, the small size 
of the packet results in a large number of grid cells and causes duplications. 




Packet Size (byte) 



Fig. 12. Variance of Tuning Time 
(UNIFORM) 




Fig. 13. Performance vs. Size of 
Datasets 
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We also measure the performance stableness of the compared indexes. Fig- 
ure 12 shows the variance of their tuning time for the UNIFORM dataset. It 
can be observed again that the grid-partition indexes outperform both D-tree 
and R.-tree in nearly all the cases. This means the power consumption of query 
processing based on grid-partition index is more predictable than that based 
on other indexes. This property is important for power management of mobile 
devices. 

In order to evaluate the scalability of the compared indexes to the number 
of data objects, we measure the tuning time of indexes by fixing the packet 
size to 256 bytes and varying the number of objects from 1,000 to 50,000 (all 
uniformly distributed). As shown in Figure 13, the larger the population of the 
objects, the worse the performance as expected. The performance ratings among 
different indexes under various numbers of data objects are consistent. However, 
it is interesting to note that the performance degradation of the grid-partition 
indexes is much more gracefully than that of D-tree and R.-tree, as the number 
of data objects increases. This indicates that the proposed grid-partition indexes 
are more pronounced for large databases. 



5.3 Power Consumption 

According to [8], a device equipped with the Hobbit chip (AT&T) consumes 
around 250mIU power in the active mode, and consumes 50/ufU power in the 
doze mode. Hence, the period of time a mobile device staying in doze mode 
during query processing also has an impact on the power consumption. To have 
a more precise comparison of the power consumption based on various indexes, 
we calculate the power consumption of a mobile device based on the periods of 
active and doze modes obtained from our experiments. For simplicity, we neglect 
other components that consume power during query processing and assume that 
250 mW constitutes the total power consumption. Figure 14 shows the power 
consumption of a mobile device under different air indexes, calculated based on 
the formula: P = 250 x Time ac ti ve + 0.05 x Timedoze- 
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As shown in the figure, the grid-partition indexes significantly outperform 
other indexes. For UNIFORM dataset, the average power consumptions of D- 
tree are 0.23raIU, and that of R.-tree is 0.69 toIU. The power consumptions of 
grid-partition indexes are 0.17mIU, 0.14mIU, and 0.14mIU for FP, SAP, and 
SP, respectively. For the REAL dataset, the improvement is also dramatic. The 
power consumptions of D-tree and R.-tree are O.lSmW and 0.14?nIU, while the 
grid-partition indexes consume 0.09?nlU, 0.08mW, and 0.06mIU. Although D- 
tree provides a better tuning time performance when the packet size is 64 bytes, 
that does not transform into less power consumption than the grid-partition 
index. This is caused by the large index overhead of D-tree, compared with that 
of grid-partition indexes (see Figure 15). 

In summary, the grid-partition indexes can reduce the power consumption by 
the efficient search performance and small index overhead. Hence, it can achieve 
the design requirement of energy efficiency without any doubt and is extremely 
suitable for the wireless broadcast environments in which the population of users 
is supposed to be huge while the resources of mobile devices are very limited. 



5.4 Access Latency 

The access latency is affected by the storage cost of the index and the interleaving 
algorithm to organize data and index. Since index organization is beyond the 
scope of this paper, we count the access latency using the well-known (1, m) 
scheme to interleave the index with data [8], as explained in Section 2.1. Figure 15 
shows the access latency for all the index methods. In the figures, the latency 
is normalized to the expected access latency without any index (i.e., half of the 
time needed to broadcast the database). 

We can see that the D-tree has the worst performance because of its large 
index size. The performance of those proposed grid-partition indexes is similar 
to that of the R.-tree. They only introduce little latency overhead (within 30% 
in most cases) due to their small index sizes. 

When different grid partition schemes are compared, the FP performs the 
best for a small packet capacity (< 256 bytes), whereas the SAP and the AP 
perform better for a large packet capacity (> 256 bytes). This can be explained 
as follows. When the packet capacity is small, the number of grid cells is large 
since we try to store the objects with a grid cell in one packet in all the three 
schemes. Thus, the index size is dominated by the overhead for storing the grid 
partition information (i.e., the upper-level index). As this overhead in the FP is 
the least (four parameters plus a pointer array), it achieves the smallest overall 
index size. However, with increasing packet capacity, the overhead for storing the 
upper-level index becomes insignificant. Moreover, with a large packet capacity 
the FP has a poorer packet occupancy than the other two. This is particularly 
true for the REAL dataset, where the objects are highly clustered. As a result, 
the index overhead of the FP becomes worse. 
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Fig. 15. Access Latency vs. Packet Capacity 



6 Conclusion 

Nearest-neighbor search is a very important and practical application in the 
emerging mobile computing era. In this paper, we analyze the problems associ- 
ated with using object-based and solution-based indexes in wireless broadcast 
environments, where only linear access is allowed, and enhance the classical 
R.-tree to make them suitable for the broadcast medium. We further propose 
the grid-partition index, a new energy-conserving air index for nearest neighbor 
search that combines the strengths of both the object-based and solution-based 
indexes. By studying the grid-partition index, we identify an interesting and 
fundamental research issue, i.e., grid partition, which affects the performance of 
the index. Three grid partition schemes, namely, fixed partition, semi-adaptive 
partition, and adaptive partition, are proposed in this study. 

The performance of the grid-partition index (with three grid partition 
schemes) is compared with an enhanced object-based index (i.e., R.-tree) and 
a solution-based index (i.e., D-tree) using both synthetic and real datasets. The 
results show that overall the grid-partition index substantially outperforms both 
the R.-tree and D-tree. As the grid-partition index (SAP) achieves the best overall 
performance under workload settings, it is recommended for practical use. 

Although the grid-partition index is proposed to efficiently solve NN search, 
it can also serve other queries, such as window queries and continuous nearest 
neighbor search. As for future work, we plan to extend the idea of the grid- 
partition index to answer multiple kinds of queries, including k-NN queries. As 
a first step, this paper only briefly addresses the update issue in a general discus- 
sion. We are investigating efficient algorithms to support updates. In addition, 
we are examining generalized NN search such as “show me the nearest hotel 
with room rate < $200”. 
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Abstract. Location monitoring is an important issue for real time management 
of mobile object positions. Significant research efforts have been dedicated to 
techniques for efficient processing of spatial continuous queries on moving ob- 
jects in a centralized location monitoring system. Surprisingly, very few have 
promoted a distributed approach to real-time location monitoring. In this paper 
we present a distributed and scalable solution to processing continuously moving 
queries on moving objects and describe the design of MobiEyes, a distributed 
real-time location monitoring system in a mobile environment. Mobieyes utilizes 
the computational power at mobile objects, leading to significant savings in terms 
of server load and messaging cost when compared to solutions relying on central 
processing of location information at the server. We introduce a set of optimization 
techniques, such as Lazy Query Propagation, Query Grouping, and Safe Periods, 
to constrict the amount of computations handled by the moving objects and to 
enhance the performance and system utilization of Mobieyes. We also provide 
a simulation model in a mobile setup to study the scalability of the MobiEyes 
distributed location monitoring approach with regard to server load, messaging 
cost, and amount of computation required on the mobile objects. 



1 Introduction 

With the growing market of positioning technologies like GPS [1] and the growing popu- 
larity and availability of mobile communications, location information management has 
become an important problem [17,10,14,5,15,13,16,9,2] in mobile computing systems. 
With continued upsurge of computational capabilities in mobile devices, ranging from 
navigational systems in cars to hand-held devices and cell phones, mobile devices are 
becoming increasingly accessible. We expect that future mobile applications will require 
a scalable architecture that is capable of handling large and rapidly growing number of 
mobile objects and processing complex queries over mobile object positions. 

Location monitoring is an important issue for real time querying and management of 
mobile object positions. Significant research efforts have been dedicated to techniques 
for efficient processing of spatial continuous queries on moving objects in a centralized 
location monitoring system. Surprisingly, very few have promoted a distributed approach 
to real-time location monitoring over a large and growing number of mobile objects. 

In this paper we present a distributed approach to real-time location monitoring 
over a large and growing number of mobile objects. Concretely, we describe the design 
of MobiEyes, a distributed real-time location monitoring system for processing moving 
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queries over moving objects in a mobile environment. Before we describe the motivation 
of the MobiEyes system and the main contributions of the paper, we first give a brief 
overview of the concept of moving queries. 

1.1 Moving Queries over Moving Objects 

A moving query over moving objects (MQ for short) is a spatial continuous moving 
query over locations of moving objects. An MQ defines a spatial region bound to a 
specific moving object and a filter which is a boolean predicate on object properties. The 
result of an MQ consists of objects that are inside the area covered by the query’s spatial 
region and satisfy the query filter. 

MQs are continuous queries [1 1] in the sense that the results of queries continuously 
change as time progresses. We refer to the object to which an MQ is bounded, the focal 
object of that query. The set of objects that are subject to be included in a query’s result 
are called target objects of the MQ. Note that the spatial region of an MQ also moves 
as the focal object of the MQ moves. There are many examples of moving queries over 
moving objects in real life. For instance, the query MQ it “Give me the number of 
friendly units within 5 miles radius around me during next 2 hours” can be submitted 
by a soldier equipped with mobile devices marching in the field, or a moving tank in a 
military setting. The query MQ 2 : “Give me the positions of those customers who are 
looking for taxi and are within 5 miles (of my location at each instance of time or at 
an interval of every minute) during the next 20 minutes” can be posted by a taxi driver 
marching on the road. The focal object of MQi is the solider marching in the held or a 
moving tank. The focal object of MQ 2 is the taxi driver on the road. 

1.2 MobiEyes: Distributed Processing of MQs 

Most of the existing approaches for processing spatial queries on moving objects are not 
scalable, due to their inherent assumption that location monitoring and communications 
of mobile objects are controlled by a central server. Namely, mobile objects report their 
position changes to the server whenever their position information changes, and the 
server determines which moving objects should be included in which moving queries 
at each instance of time or at a given time interval. For mobile applications that need 
to handle a large and growing number of moving objects, the centralized approaches 
can suffer from dramatic performance degradation in terms of server load and network 
bandwidth. 

In this paper we present MobiEyes, a distributed solution for processing MQs in a 
mobile setup. Our solution ships some part of the query processing down to the moving 
objects, and the server mainly acts as a mediator between moving objects. This signifi- 
cantly reduces the load on the server side and also results in savings on the communication 
between moving objects and the server. 

This paper has three unique contributions. First, we present a careful design of 
the distributed solution to real-time evaluation of continuously moving queries over 
moving objects. One of the main design principles is to develop efficient mechanisms 
that utilize the computational power at mobile objects, leading to significant savings 
in terms of server load and messaging cost when compared to solutions relying on 
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central processing of location information at the server. Second, we develop a number 
of optimization techniques, including Lazy Query Propagation, and Query Grouping. 
We use the Query Grouping techniques to constrict the amount of computation to be 
performed by the moving objects in situations where a moving object is moving in an 
area that has many queries. We use Lazy Query Propagation to allow trade offs between 
query precision and network bandwidth cost and energy consumption on the moving 
objects. Third, but not least, we provide a simulation model in a mobile setup to study 
the scalability of the MobiEyes distributed location monitoring approach with regard to 
server load, messaging cost, and amount of computation required on the mobile objects. 

2 System Model 

2.1 System Assumptions 

We below summarize four underlying assumptions used in the design of the MobiEyes 
system. All these assumptions are either widely agreed upon by many or have been seen 
as common practice in most existing mobile systems in the context of monitoring and 
tracking of moving objects. 

— Moving objects are able to locate their positions: We assume that each moving object 
is equipped with a technology like GPS [1] to locate its position. This is a reasonable 
assumption as GPS devices are becoming inexpensive and are used widely in cars and 
other hand-held devices to provide navigational support. 

— Moving objects have synchronized clocks: Again this assumption can be met if the 
moving objects are equipped with GPS. Another solution is to make NTP [ 12] (network 
time protocol) available to moving objects through base stations. 

— Moving objects are able to determine their velocity vector: This assumption is easily 
met when the moving object is able to determine its location and has an internal timer. 

— Moving objects have computational capabilities to carry out computational tasks: 
This assumption represents a fast growing trend in mobile and wireless technology. The 
number of mobile devices equiped with computational power escalates rapidly, even 
simple sensors [6] today are equipped with computational capabilities. 

2.2 The Moving Object Model 

MobiEyes system assumes that the geographical area of interest is covered by several 
base stations, which are connected to a central server. A three-tier architecture (mobile 
objects, base stations and the server) is used in the subsequent discussions. We can easily 
extend the three tier to a multi-tier communication hierarchy between the mobile objects 
and the server. In addition, the asymmetric communication is used to establish connec- 
tions from the server to the moving objects. Concretely, a base station can communicate 
directly with the moving objects in its coverage area through a broadcast, and the moving 
objects can only communicate with the base station if they are located in the coverage 
area of this base station. 

Let O be the set of moving objects. Formally we can describe a moving object o G O 
by a quadruple: ( oid,pos , vel, {props}), oid is the unique object identifier, pos is the 
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current position of the object o. vel = ( velx , vely) is the current velocity vector of 
the object, where velx is its velocity in the x-dimension and vely is its velocity in the 
y-dimension. {props} is a set of properties about the moving object o, including spatial, 
temporal, or object-specific properties, such as color or manufacture of a mobile unit (or 
even the application specific attributes registered on the mobile unit by the user). 

The basic notations used in the subsequent sections of the paper are formally defined 
below: 

— Rectangle shaped region and circle shaped region : A rectangle shaped region is 
defined by Rect(lx, ly,w, h) = {(x,y) : x £ [lx,lx+w\ Ay £ [ly, ly+h]} and a circle 
shaped region is defined by Circle(cx, cy , r) = {(a:, y) : (x — cx ) 2 + (y — cx ) 2 < r 2 }. 

— Universe of Discourse (UoD): We refer to the geographical area of interest as the 
universe of discourse, which is defined by U = Rect(X, Y, W, H). X, Y, W and II are 
system level parameters to be set at the system initialization time. 

— Grid and Grid cells: In MobiEyes, we map the universe of discourse, U = 
Rect(X,Y,W, H), onto a grid G of cells, where each grid cell is an a x a square 
area, and a is a system parameter that defines the cell size of the grid G. Formally, a grid 
corresponding to the universe of discourse U can be defined as G(U, a ) = {Aij : 1 < 
i < M, 1 < j < N, Aij = Rect(X+i*a, Y+j*a, a,a),M = \H/a\,N = \W/ot \ }. 
A i : j is an cc x a square area representing the grid cell that is located on the /th row and 
jth column of the grid G. 

— Position to Grid Cell Mapping: Let pos = (x, y) be the position of a moving object 
in the universe of discourse U = Rect(X, Y, W, H). Let A t J denote a cell in the grid 
G(U,a). Pmap(pos ) is a position to grid cell mapping, defined as Pmap(pos) = 

A |- pos .x— X i |-pos.y — . 

— Current Grid Cell of an Object: Current grid cell of a moving object is the grid cell 
which contains the current position of the moving object. If o £ O is an object whose 
current position, denoted as o.pos, is in the Universe of Discourse U, then the current 
grid cell of the object is formally defined by curr_ce(((o) = Pmap(o.pos). 

— Base Stations: Let U = Rect(X, Y, IT 7 , II ) be the universe of discourse and B be the 
set of base stations overlapping with U. Assume that each base station b £ B is defined 
by a circle region Circle(bsx, bsy , bsr). We say that the set B of base stations covers 
the universe of discourse U, i.e. ( IJbgs b) 2 U. 

— Grid Cell to Base Station Mapping: Let Bmap : N x N — »• 2 B define a mapping, 
which maps a grid cell index to a non-empty set of base stations. We define Bmap(i,j) = 
{b : b £ B A b fl Ai j ^ 0}. Bmap(i, j) is the set of base stations that cover the grid 
cell Aij. 

2.3 Moving Query Model 

Let Q be the set of moving queries. Formally we can describe a moving query q £ Q by a 
quadruple: {qid, oid, region, filter), qid is the unique query identifier, old is the object 
identifier of the focal object of the query, region defines the shape of the spatial query 
region bound to the focal object of the query, region can be described by a closed shape 
description such as a rectangle, or a circle, or any other closed shape description which 
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has a computationally cheap point containment check. This closed shape description 
also specifies a binding point, through which it is bound to the focal object of the query. 
Without loss of generality we use a circle, with its center serving as the binding point to 
represent the shape of the region of a moving query in the rest of the paper, f ilter is a 
Boolean predicate defined over the properties {props} of the target objects of a moving 
query q. For presentation convenience, in the rest of the paper we consider the result of 
an MQ as the set of object identifiers of the moving objects that locate within the area 
covered by the spatial region of the query and satisfy the filter condition. 

A formal definition of basic notations regarding MQs is given below. 

— Bounding Box of a Moving Query : Let q € Q be a query with focal object fo £ O and 
spatial region region, let rc denote the current grid cell of fo, i.e. rc = curr.cell(fo). 
Let lx and ly denote the ^-coordinate and the y-coordinate of the lower left corner 
point of the current grid cell rc. The Bounding Box of a query q is a rectangle shaped 
region, which covers all possible areas that the spatial region of the query q may move 
into when the focal object fo of the query travels within its current grid cell. For circle 
shaped spatial query region with radius r, the bounding box can be formally defined as 
bound J)ox(q) = Rectfrc.lx — r, rc.ly — r, a + 2r, a + 2 r). 

— Monitoring Region of a Moving Query. The grid region defined by the union of 

all grid cells that intersect with the bounding box of a query forms the monitoring 
region of the query. It is formally defined as, monjregionfq) = (J (i j)es where 

S = ■ Aij fl boundJjox(q) ^ 0}. The monitoring region of a moving query 

covers all the objects that are subject to be included in the result of the moving query 
when the focal object stays in its current grid cell. 

— Nearby Queries of an Object: Given a moving object o, we refer to all MQs whose 
monitoring regions intersect with the current grid cell of the moving object o the nearby 
queriesoftheob)ecto.i.e.nearby-queries(o) = {q : monjregion(q)r\curr-cell(o) ^ 
0 A q £ Q}. Every mobile object is either a target object of or is of potential interest to 
its nearby MQs. 

3 Distributed Processing of Moving Queries 

In this section we give an overview of our distributed approach to efficient processing 
of MQs, and then focus on the main building blocks of our solution and the important 
algorithms used. A comparison of our work with the related research in this area is 
provided in Section 6. 

3.1 Algorithm Overview 

In MobiEyes, distributed processing of moving queries consists of server side processing 
and mobile object side processing. The main idea is to provide mechanisms such that 
each mobile object can determine by itself whether or not it should be included in the 
result of a moving query close by, without requiring global knowledge regarding the 
moving queries and the object positions. A brief review is given below on the main 
components and key ideas used in MobiEyes for distributed processing of MQs. We will 
provide detailed technical discussion on each of these ideas in the subsequent sections. 
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Server Side Processing: The server side processing can be characterized as mediation 
between moving objects. It performs two main tasks. First, it keeps track of the significant 
position changes of all focal objects, namely the change in velocity vector and the 
position change that causes the focal object to move out of its current grid cell. Second, 
it broadcasts the significant position changes of the focal objects and the addition or 
deletion of the moving queries to the appropriate subset of moving objects in the system. 

Monitoring Region of a MQ: To enable efficient processing at mobile object side, we 
introduce the monitoring region of a moving query to identify all moving objects that 
may get included in the query’s result when the focal object of the query moves within 
its current cell. The main idea is to have those moving objects that reside in a moving 
query’s monitoring region to be aware of the query and to be responsible for calculating 
if they should be included in the query result. Thus, the moving objects that are not in 
the neighborhood of a moving query do not need to be aware of the existence of the 
moving query, and the query result can be efficiently maintained by the objects in the 
query’s monitoring region. 

Registering MQs at Moving Object Side: In MobiEyes, the task of making sure that the 
moving objects in a query’s monitoring region are aware of the query is accomplished 
through server broadcasts, which are triggered by either installations of new moving 
queries or notifications of changes in the monitoring regions of existing moving queries 
when their focal objects change their current grid cells. Upon receiving a broadcast 
message, for each MQ in the message, the mobile objects examine their local state and 
determine whether they should be responsible for processing this moving query. This 
decision is based on whether the mobile objects themselves are within the monitoring 
region of the query. 

Moving Object Side Processing: Once a moving query is registered at the moving 
object side, the moving object will be responsible for periodically tracking if it is within 
the spatial region of the query, by predicting the position of the focal object of the query. 
Changes in the containment status of the moving object with respect to moving queries 
are differentially relayed to the server. 

Handling Significant Position Changes: In case the position of the focal object of a 
moving query changes significantly (it moves out of its current grid cell or changes 
its velocity vector significantly), it will report to the server, and the server will relay 
such position change information to the appropriate subset of moving objects through 
broadcasts. 



3.2 Data Structures 

In this section we describe the design of the data structures used on the server side and 
on the moving object side, in order to support distributed processing of MQs. 

Server-Side Data Structures 

The server side stores four types of data structures: the focal object table FOT, the server 
side moving query table SQT , the reverse query index matrix RQI, and the static grid 
cell to base station mapping Bmap. 
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Focal Object Table, FOT = ( oid, pos , vel, tm), is used to store information about 
moving objects that are the focal objects of MQs. The table is indexed on the oid attribute, 
which is the unique object identifier, tm is the time at which the position, pos, and the 
velocity vector, vel, of the focal object with identifier oid were recorded on the moving 
object side. When the focal object reports to the server its position and velocity change, 
it also includes this timestamp in the report. 

Ser\’er-side Moving Query Table, SQT = { qid , oid , region, curr_cell, 
mon-region , filter, {result}), is used to store information about all spatial queries 
hosted by the system. The table is indexed on the qid attribute, which represents the 
query identifier, oid is the identifier of the focal object of the query, region is the 
query’s spatial region, curr.cell is the grid cell in which the focal object of the query 
locates, monjregion is the monitoring region of the query, {result} is the set of object 
identifiers representing the set of target objects of the query. These objects are located 
within the query’s spatial region and satisfy the query filter. 

Reverse Query Index, RQI, is an M x N matrix whose cells are a set of query 
identifiers. M and N denote the number of rows and the number of columns of the Grid 
corresponding to the Universe of Discourse of a MobiEyes system. RQI{i,j) stores 
the identifiers of the queries whose monitoring regions intersect with the grid cell A^j. 
RQI{i,j) represents the nearby queries of an object whose current grid cell is A t J , i.e. 
Vo £ O , nearby -queries(o) = RQI(i,j), where curr .celKo) = Aij. 

Moving Object-Side Data Structures 

Each moving object o stores a local query table LQT and a Boolean variable hasMQ. 

Local Query Table, LQT = {qid, pos, vel, tm , region, monjregion, isTarget) 

is used to store information about moving queries whose monitoring regions intersect 
with the current grid cell in which the moving object o currently locates in. qid is the 
unique query identifier assigned at the time when the query is installed at the server, pos 
is the last known position, and vel is the last known velocity vector of the focal object of 
the query, tm is the time at which the position and the velocity vector of the focal object 
was recorded (by the focal object of the query itself, not by the object on which LQT 
resides). isTarget is a Boolean variable describing whether the object was found to be 
inside the query’s spatial region at the last evaluation of this query by the moving object 
o. The Boolean variable has M Q provides a flag showing whether the moving object o 
storing the LQT is a focal object of some query or not. 

3.3 Installing Queries 

Installation of a moving query into the MobiEyes system consists of two phases. First, the 
MQ is installed at the server side and the server state is updated to reflect the installation 
of the query. Second, the query is installed at the set of moving objects that are located 
inside the monitoring region of the query. 

Updating the Server State 

When the server receives a moving query, assuming it is in the form {oid, region, 
filter ), it performs the following installation actions. (1) It first checks whether the 
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focal object with identifier oid is already contained in the FOT table. (2) If the focal 
object of the query already exists, it means that either someone else has installed the 
same query earlier or there exist multiple queries with different filters but the same 
focal object. Since the FOT table already contains velocity and position information 
regarding the focal object of this query, the installation simply creates a new entry for 
this new MQ and adds this entry to the sever-side query table SQT and then modifies the 
RQI entry that corresponds to the current grid cell of the focal object to include this new 
MQ in the reverse query index (detailed in step (4)). At this point the query is installed 
on the server side. (3) However, if the focal object of the query is not present in the 
FOT table, then the server-side installation manager needs to contact the focal object 
of this new query and request the position and velocity information. Then the server 
can directly insert the entry (oid,pos, vel, tm ) into FOT, where trn is the timestamp 
when the object with identifier oid has recorded its pos and vel information. (4) The 
server then assigns a unique identifier qid to the query and calculates the current grid cell 
( curr_cell ) of the focal object and the monitoring region ( monjregion ) of the query. 
A new moving query entry (qid, oid , region, curr^cell, moruregion, filter) will be 
created and added into the SQT table. The server also updates the RQI index by adding 
this query with identifier qid to RQI ( i,j ) if Aj j fl monjregion(qid) ^ 0. At this point 
the query is installed on the server side. 

Installing Queries on the Moving Objects 

After installing queries on the server side, the server needs to complete the installation by 
triggering query installation on the moving object side. This job is done by performing 
two tasks. First, the server sends an installation notification to the focal object with 
identifier oid, which upon receiving the notification sets its hasMQ variable to true. 
This makes sure that the moving object knows that it is now a focal object and is supposed 
to report velocity vector changes to the server. The second task is for the server to forward 
this query to all objects that reside in the query’s monitoring region, so that they can 
install the query and monitor their position changes to determine if they become the 
target objects of this query. To perform this task, the server uses the mapping Bmap to 
determine the minimal set of base stations (i.e., the smallest number of base stations) 
that covers the monitoring region. Then the query is sent to all objects that are covered 
by the base stations in this set through broadcast messages. When an object receives 
the broadcast message, it checks whether its current grid cell is covered by the query’s 
monitoring region. If so, the object installs the query into its local query table LQT 
when the query’s filter is also satisfied by the object. Otherwise the object discards the 
message. 



3.4 Handling Velocity Vector Changes 

Once a query is installed in the MobiEyes system, the focal object of the query needs 
to report to the server any significant change to its location information, including sig- 
nificant velocity changes or changes that move the focal object out of its current grid 
cell. We describe the mechanisms for handling velocity changes in this section and the 
mechanisms for handling objects that change their current grid cells in the next section. 
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A velocity vector change, once identified as significant, will need to be relayed to 
the objects that reside in the query’s monitoring region through the server acting as a 
mediator. When the focal object of a query reports a velocity vector change, it sends 
its new velocity vector, its position and the timestamp at which this information was 
recorded, to the server. The server first updates the FOT table with the information 
received from the focal object. Then for each query associated with the focal object, the 
server communicates the newly received information to objects located in the monitoring 
region of the query by using minimum number of broadcasts (this can be done through 
the use of the grid cell to base station mapping Bmap). 

A subtle point is that, the velocity vector of the focal object will almost always 
change at each time step in a real world setup, although the change might be insignificant. 
One way to handle this is to convey the new velocity vector information to the objects 
located in the monitoring region of the query, only if the change in the velocity vector is 
significant. In MobiEyes, we use a variation of dead reckoning to decide what constitutes 
a (significant) velocity vector change. 

Dead Reckoning in MobiEyes 

Concretely, at each time step the focal object of a query samples its current position 
and calculates the difference between its current position and the position that the other 
objects believe it to be at (based on the last velocity vector information relayed). In case 
this difference is larger than a threshold, say A, the new velocity vector information is 
relayed 1 . 

3.5 Handling Objects That Change Their Grid Cells 

In a mobile system the fact that a moving object changes its current grid cell has an 
impact on the set of queries the object is responsible for monitoring. In case the object 
which has changed its current grid cell is a focal object, the change also has an impact 
on the set of objects which has to monitor the queries bounded to this focal object. In 
this section we describe how the MobiEyes system can effectively adapt to such changes 
and the mechanisms used for handling such changes. 

When an object changes its current grid cell, it notifies the server of this change 
by sending its object identifier, its previous grid cell and its new current grid cell to 
the server. The object also removes those queries whose monitoring regions no longer 
cover its new current grid cell from its local query table LQT. Upon receipt of the 
notification, the server performs two sets of operations depending on whether the object 
is a focal object of some query or not. If the object is a non-focal object, the only thing 
that the server needs to do is to find what new queries should be installed on this object 
and then perform the query installation on this moving object. This step is performed 
because the new current grid cell that the object has moved into may intersect with the 
monitoring regions of a different set of queries than its previous set. The server uses 
the reverse query index RQI together with the previous and the new current grid cell 
of the object to determine the set of new queries that has to be installed on this moving 
object. Then the server sends the set of new queries to the moving object for installation. 

1 We do not consider the inaccuracy introduced by the motion modeling. 
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The focal object table FOT and the server query table SQT are used to create required 
installation information of the queries to be installed on the object. However, if the 
object that changes its current grid cell is a focal object of some query, additional set of 
operations are performed. For each query with this object as its focal object, the server 
performs the following operations. It updates the query’s SQT table entry by resetting the 
current grid cell and the monitoring region to their new values. It also updates the RQI 
index to reflect the change. Then the server computes the union of the query’s previous 
monitoring region and its new monitoring region, and sends a broadcast message to all 
objects that reside in this combined area. This message includes information about the 
new state of the query. Upon receipt of this message from the server, an object performs 
the following operations for installing/removing a query. It checks whether its current 
grid cell is covered by the query’s monitoring region. If not, the object removes the query 
from its LQT table (if the entry already exists), since the object’s position is no longer 
covered by the query’s monitoring region. Otherwise, it installs the query if the query is 
not already installed and the query filter is satisfied, by adding a new query entry in the 
LQT table. In case that the query is already installed in LQT, it updates the monitoring 
region of the query’s entry in LQT. 

Optimization: Lazy Query Propagation 

The procedure we presented above uses an eager query propagation approach for han- 
dling objects changing their current grid cells. It requires each object (focal or non-focal) 
to contact the server and transfer information whenever it changes its current grid cell. 
The only reason for a non-focal object to communicate with the server is to immediately 
obtain the list of new queries that it needs to install in response to changing its current 
grid cell. We refer to this scheme as the Eager Query Propagation ( EQP ). 

To reduce the amount of communication between moving objects and the server, 
in MobiEyes we also provide a lazy query propagation approach. Thus, the need for 
non-focal objects to contact the server to obtain the list of new MQs can be eliminated. 
Instead of obtaining the new queries from the server and installing them immediately on 
the object upon a grid cell change, the moving object can wait until the server broadcasts 
the next velocity vector changes regarding the focal objects of these queries, to the area 
in which the object locates. In this case the velocity vector change notifications are 
expanded to include the spatial region and the filter of the queries, so that the object 
can install the new queries upon receiving the broadcast message on the velocity vector 
changes of the focal objects of the moving queries. Using lazy propagation, the moving 
objects upon changing their current grid cells will be unaware of the new set of queries 
nearby until the focal objects of these queries change their velocity vectors or move out 
of their current grid cells. Obviously lazy propagation works well when the gird cell size 
a is large and the focal objects of queries change their velocity vectors frequently. The 
lazy query propagation may not prevail over the eager query propagation, when: ( 1 ) the 
focal objects do not have significant change on their velocity vectors, (2) the grid cell 
size a is too small, and (3) the non-focal moving objects change their current grid cells 
at a much faster rate than the focal objects. In such situations, non-focal objects may 
end up missing some moving queries. We evaluate the Lazy Query Propagation (LQP) 
approach and study its performance advantages as well as its impact on the query result 
accuracy in Section 5. 
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3.6 Moving Object Query Processing Logic 

A moving object periodically processes all queries registered in its LQT table. For each 
query, it predicts the position of the focal object of the query using the velocity, time, 
and position information available in the LQT entry of the query. Then it compares 
its current position and the predicted position of the query’s focal object to determine 
whether itself is covered by the query’s spatial region or not. When the result is different 
from the last result computed in the previous time step, the object notifies the server of 
this change, which in turn differentially updates the query result. 



4 Optimizations 

In this section we present two additional optimization techniques aiming at control- 
ling the amount of local processing at the mobile object side and further reducing the 
communication between the mobile objects and the server. 

4.1 Query Grouping 

It is widely recognized that a mobile user can pose many different queries and a query 
can be posed multiple times by different users. Thus, in a mobile system many moving 
queries may share the same focal object. Effective optimizations can be applied to 
handle multiple queries bound to the same moving object. These optimizations help 
decreasing both the computational load on the moving objects and the messaging cost 
of the MobiEyes approach, in situations where the query distribution over focal objects 
is skewed. 

We define a set of moving queries as groupable MQs if they are bounded to the 
same focal object. In addition to being associated with the same focal object, some 
groupable queries may have the same monitoring region. We refer to MQs that have 
the same monitoring region as MQs with matching monitoring regions, where we refer 
to MQs that have different monitoring regions as MQs with non-matching monitoring 
regions. Based on these different patterns, different grouping techniques can be applied 
to groupable MQs. 

Grouping MQs with Matching Monitoring Regions 

MQs with matching monitoring regions can be grouped most efficiently to reduce the 
communication and processing costs of such queries. In MobiEyes, we introduce the 
concept of query bitmap, which is a bitmap containing one bit for each query in a query 
qroup, each bit can be set to 1 or 0 indicating whether the corresponding query should 
include the moving object in its result or not. We illustrate this with an example. Consider 
three MQs: qi = ( qid\ , oidi, 7*1, filter i), <72 = ( qid2 , oidi, f2, filterf), and <73 = 
(qids, oidi, T3, . filter 3 ) that share the same monitoring region. Note that these queries 
share their focal object, which is the object with identifier oidi. Instead of shipping three 
separate queries to the mobile objects, the server can combine these queries into a single 
query as follows: q 3 = (qid3, oidi, (t*i, f2, ^3), {f Uteri, filter 2, f ilters)). With this 
grouping at hand, when a moving object is processing a set of groupable MQs with 
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matching monitoring regions, it needs to consider queries with smaller radiuses only if 
it finds out that its current position is inside the spatial region of a query with a larger 
radius. When a moving object reports to the server whether it is included in the results 
of queries that form the grouped query or not, it will attach the query bitmap to the 
notification. For each query, its query bitmap bit is set to 1 only if the moving object 
stays inside the spatial region of the query and the filter of the query is satisfied. With 
the query bitmap, the server is able to infer information about individual query results 
with respect to the reporting object. 

Grouping MQs with Non-matching Monitoring Regions 

A clean way to handle MQs with non-matching monitoring regions is to perform grouping 
on the moving object side only. We illustrate this with an example. Consider an object o :/ 
that has two groupable MQs with non-matching monitoring regions, (74 and q-,- installed 
in its LQT table. Since there is no global server side grouping performed for these 
queries, Oj can save some processing only by combining these two queries inside its 
LQT table. By this way it only needs to consider the query with smaller radius only if 
it finds out that its current position is inside the spatial region of the one with the larger 
radius. 



4.2 Safe Period Optimization 



In MobiEyes, each moving object that resides in the monitoring region of a query needs 
to evaluate the queries registered in its local query table LQT periodically. For each 
query the candidate object needs to determine if it should be included in the answer 
of the query. The interval for such periodic evaluation can be set either by the server 
or by the mobile object itself. A safe -period optimization can be applied to reduce the 
computation load on the mobile object side, which computes a safe period for each 
object in the monitoring region of a query, if an upper bound ( maxV el) exists on the 
maximum velocities of the moving objects. 

The safe periods for queries are calculated by an object o as follows: For each 
query q in its LQT table, the object o calculates a worst case lower bound on the 
amount of time that has to pass for it to locate inside the area covered by the query q’s 
spatial region. We call this time, the safe period(sp) of the object o with respect to the 
query q , denoted as sp(o,q). The safe period can be formally defined as follows. Let 
Oi be the object that has the query q^ with focal object Oj in its LQT table, and let 
dist(o.i, Oj) denote the distance between these two objects, and let .region denote the 
circle shaped region with radius r. In the worst case, the two objects approach to each 
other with their maximum velocities in the direction of the shortest path between them. 
Then s v(o- oil - dist(oi,o j)-r 

Once the safe period sp of a moving object is calculated for a query, it is safe for 
the object to start the periodic evaluation of this query after the safe period has passed. 
In order to integrate this optimization with the base algorithm, we include a processing 
time (ptm ) field into the LQT table, which is initialized to 0. When a query in LQT 
is to be processed, ptm is checked first. In case ptm is ahead of the current time ctm, 
the query is skipped. Otherwise, it is processed as usual. After processing of the query. 
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Table 1 . Simulation Parameters 



Parameter 


Description 


Value range 


Default value 


ts 


Time step 


30 seconds 




a. 


Grid cell side length 


0.5-16 miles 


5 miles 


no 


Number of objects 


1,000-10,000 


10,000 


nmq 


Number of moving queries 


100-1,000 


1,000 


nmo 


Number of objects changing velocity vector per time step 


100-1,000 


1,000 


area 


Area of consideration 


100,000 square miles 




alen 


Base station side length 


5-80 miles 


10 miles 


qradius 


Query radius 


{3, 2, 1,4, 5} miles 




qselect 


Query selectivity 


0.75 




mospeed 


Max. object speed 


{100, 50, 150, 200, 250} miles/hour 





if the object is found to be outside the area covered by the query’s spatial region, the 
safe period sp is calculated for the query and processing time ptm of the query is set to 
current time plus the safe period, ctm+sp. When the query evaluation period is short, or 
the object speeds are low or the cell size a of the grid is large, this optimization can be 
very effective. 

5 Experiments 

In this section we describe three sets of simulation based experiments. The first set of 
experiments illustrates the scalability of the MobiEyes approach with respect to server 
load. The second set of experiments focuses on the messaging cost and studies the effects 
of several parameters on the messaging cost. The third set of experiments investigates 
the amount of computation a moving object has to perform, by measuring on average 
the number of queries a moving object needs to process during each local evaluation 
period. 

5.1 Simulation Setup 

We list the set of parameters used in the simulation in Table 1 . In all of the experiments, 
the parameters take their default values if not specified otherwise. The area of interest 
is a square shaped region of 100,000 square miles. The number of objects we consider 
ranges from 1,000 to 10,000 where the number of queries range from 100 to 1,000. 

We randomly select focal objects of the queries using a uniform distribution. The 
spatial region of a query is taken as a circular region whose radius is a random variable 
following a normal distribution. For a given query, the mean of the query radius is 
selected from the list {3, 2, 1, 4, 5}(miles) following a zipf distribution with parameter 
0.8 and the std. deviation of the query radius is taken as l/5th of its mean. The selectivity 
of the queries is taken as 0.75. 

We model the movement of the objects as follows. We assign a maximum velocity to 
each object from the list {100, 50, 150, 200, 250}(miles/hour), using a zipf distribution 
with parameter 0.8. The simulation has a time step parameter of 30 seconds. In every 
time step we pick a number of objects at random and set their normalized velocity vectors 
to a random direction, while setting their velocity to a random value between zero and 
their maximum velocity. All other objects are assumed to continue their motion with 
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Fig. 1. Impact of distributed Fig. 2. Error associated with Fig. 3. Effect of a on server 
query processing on server load lazy query propagation load 



their unchanged velocity vectors. The number of objects that change velocity vectors 
during each time step is a parameter whose value ranges from 100 to 1,000. 

5.2 Server Load 

In this section we compare our MobiEyes distributed query processing approach with 
two popular central query processing approaches, with regard to server load. The two 
centralized approaches we consider are indexing objects and indexing queries. Both are 
based on a central server on which the object locations are explicitly manipulated by the 
server logic as they arrive, for the purpose of answering queries. We can either assume that 
the objects are reporting their positions periodically or we can assume that periodically 
object locations are extracted from velocity vector and time information associated with 
moving objects, on the server side. We first describe these two approaches and later 
compare them with the distributed MobiEyes distributed approach with regard to server 
load. 

Indexing Objects. The first centralized approach to processing spatial continuous 
queries on moving objects is by indexing objects. In this approach a spatial index is 
built over object locations. We use an R*-tree [3] for this purpose. As new object posi- 
tions are received, the spatial index (the R* -tree) on object locations is updated with the 
new information. Periodically all queries are evaluated against the object index and the 
new results of the queries are determined. This is a straightforward approach and it is 
costly due to the frequent updates required on the spatial index over object locations. 
Indexing Queries. The second centralized approach to processing spatial continuous 
queries on moving objects is by indexing queries. In this approach a spatial index, again 
an R*-tree indeed, is built over moving queries. As the new positions of the focal objects 
of the queries are received, the spatial index is updated. This approach has the advantage 
of being able to perform differential evaluation of query results. When a new object 
position is received, it is run through the query index to determine to which queries 
this object actually contributes. Then the object is added to the results of these queries, 
and is removed from the results of other queries that have included it as a target object 
before. We have implemented both the object index and the query index approaches for 
centralized processing of MQs. As a measure of server load, we took the time spent by 
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the simulation for executing the server side logic per time step. Figure 1 and Figure 3 
depict the results obtained. Note that the y- axises, which represent the sever load, are in 
log-scale. The rr-axis represents the number of queries considered in Figure 1, and the 
different settings of a parameter in Figure 3. 

It is observed from Figure 1 that the MobiEyes approach provides up to two orders 
of magnitude improvement on server load. In contrast, the object index approach has an 
almost constant cost, which slightly increases with the number of queries. This is due 
to the fact that the main cost of this approach is to update the spatial index when object 
positions change. Although the query index approach clearly outperforms the object 
index approach for small number of queries, its performance worsens as the number of 
queries increase. This is due to the fact that the main cost of this approach is to update the 
spatial index when focal objects of the queries change their positions. Our distributed 
approach also shows an increase in server load as the number of queries increase, but it 
preserves the relative gain against the query index. 

Figure 1 also shows the improvement in server load using lazy query propagation 
(LQP) compared to the default eager query propagation (EQP). However as described 
in Section 3.5, lazy query propagation may have some inaccuracy associated with it. 
Figure 2 studies this inaccuracy and the parameters that influence it. For a given query, 
we define the error in the query result at a given time, as the number of missing object 
identifiers in the result (compared to the correct result) divided by the size of the correct 
query result. Figure 2 plots the average error in the query results when lazy query 
propagation is used as a function of number of objects changing velocity vectors per 
time step for different values of a. Frequent velocity vector changes are expected to 
increase the accuracy of the query results. This is observed from Figure 2 as it shows 
that the error in query results decreases with increasing number of objects changing 
velocity vectors per time step. Frequent grid cell crossings are expected to decrease the 
accuracy of the query results. This is observed from Figure 2 as it shows that the error 
in query results increases with decreasing a. 

Figure 3 shows that the performance of the MobiEyes approach in terms of server 
load worsens for too small and too large values of the a parameter. However it still 
outperforms the object index and query index approaches. For small values of a, the 
frequent grid cell changes increase the server load. On the other hand, for large values 
of a , the large monitoring areas increase the server’s job of mediating between focal 
objects and the objects that are lying in the monitoring regions of the focal objects’ 
queries. Several factors may affect the selection of an appropriate a value. We further 
investigate selecting a good value for a in the next section. 

5.3 Messaging Cost 

In this section we discuss the effects of several parameters on the messaging cost of our 
solution. In most of the experiments presented in this section, we report the total number 
of messages sent on the wireless medium per second. The number of messages reported 
includes two types of messages. The first type of messages are the ones that are sent 
from a moving object to the server (uplink messages), and the second type of messages 
are the ones broadcasted by a base station to a certain area or sent to a moving object as a 
one-to-one message from the server (downlink messages). We evaluate and compare our 
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Fig. 4. Effect of a on messag- 
ing cost 



Fig. 5. Effect of # of objects on 
messaging cost 



Fig. 6. Effect of # of objs. on 
uplink messaging cost 



results using two different scenarios. In the first scenario each object reports its position 
directly to the server at each time step, if its position has changed. We name this as the 
naive approach. In the second scenario each object reports its velocity vector at each 
time step, if the velocity vector has changed (significantly) since the last time. We name 
this as the central optimal approach. As the name suggests, this is the minimum amount 
of information required for a centralized approach to evaluate queries unless there is an 
assumption about object trajectories. Both of the scenarios assume a central processing 
scheme. 

One crucial concern is defining an optimal value for the parameter a, which is the 
length of a grid cell. The graph in Figure 4 plots the number of messages per second 
as a function of a for different number of queries. As seen from the figure, both too 
small and too large values of a have a negative effect on the messaging cost. For smaller 
values of a this is because objects change their current grid cell quite frequently. For 
larger values of a this is mainly because the monitoring regions of the queries become 
larger. As a result, more broadcasts are needed to notify objects in a larger area, of the 
changes related to focal objects of the queries they are subject to be considered against. 
Figure 4 shows that values in the range [4,6] are ideal for a with respect to the number of 
queries ranging from 100 to 1000. The optimal value of the a parameter can be derived 
analytically using a simple model. In this paper we omit the analytical model for space 
restrictions. 

Figure 5 studies the effect of number of objects on the messaging cost. It plots 
the number of messages per second as a function of number of objects for different 
numbers of queries. While the number of objects is altered, the ratio of the number of 
objects changing their velocity vectors per time step to the total number of objects is 
kept constant and equal to its default value as obtained from Table 1 . It is observed that, 
when the number of queries is large and the number of objects is small, all approaches 
come close to one another. However, the naive approach has a high cost when the ratio of 
the number of objects to the number of queries is high. In the latter case, central optimal 
approach provides lower messaging cost, when compared to MobiEyes with EQP, but 
the gap between the two stays constant as number of objects are increased. On the other 
hand, MobiEyes with LQP scales better than all other approaches with increasing number 
of objects and shows improvement over central optimal approach for smaller number of 
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Fig. 7. Effect of number of ob- 
jects changing velocity vector 
per time step on messaging cost 



Fig. 8. Effect of base station 
coverage area on messaging 
cost 



Fig. 9. Effect of # of queries on 
per object power consumption 
due to communication 



queries. Figure 6 shows the uplink component of the messaging cost. The yy-axis is plotted 
in logarithmic scale for convenience of the comparison. Figure 6 clearly shows that 
MobiEyes with LQP significantly cuts down the uplink messaging requirement, which 
is crucial for asymmetric communication environments where uplink communication 
bandwidth is considerably lower than downlink communication bandwidth. 

Figure 7 studies the effect of number of objects changing velocity vector per time 
step on the messaging cost. It plots the number of messages per second as a function of 
the number of objects changing velocity vector per time step for different numbers of 
queries. An important observation from Figure 7 is that the messaging cost of MobiEyes 
with EQP scales well when compared to the central optimal approach as the gap between 
the two tends to decrease as the number of objects changing velocity vector per time 
step increases. Again MobiEyes with LQP scales better than all other approaches and 
shows improvement over central optimal approach for smaller number of queries. 

Figure 8 studies the effect of base station coverage area on the messaging cost. It 
plots the number of messages per second as a function of the base station coverage area 
for different numbers of queries. It is observed from Figure 8 that increasing the base 
station coverage decreases the messaging cost up to some point after which the effect 
disappears. The reason for this is that, after the coverage areas of the base stations reach 
to a certain size, the monitoring regions associated with queries always lie in only one 
base station’s coverage area. Although increasing base station size decreases the total 
number of messages sent on the wireless medium, it will increase the average number 
of messages received by a moving object due to the size difference between monitoring 
regions and base station coverage areas. In a hypothetical case where the universe of 
disclosure is covered by a single base station, any server broadcast will be received by 
any moving object. In such environments, indexing on the air [7] can be used as an 
effective mechanism to deal with this problem. In this paper we do not consider such 
extreme scenarios. 

Per Object Power Consumption Due to Communication 

So far we have considered the scalability of the MobiEyes in terms of the total number of 
messages exchanged in the system. However one crucial measure is the per object power 
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Fig. 10. Effect of q on the av- 
erage number of queries eval- 
uated per step on a moving 
object 



Fig. 11. Effect of the total # of 
queries on the avg. # of queries 
evaluated per step on a moving 
object 



Fig. 12. Effect of the query ra- 
dius on the average number of 
queries evaluated per step on a 
moving object 



consumption due to communication. We measure the average communication related to 
power consumption using a simple radio model where the transmission path consists 
of transmitter electronics and transmit amplifier where the receiver path consists of re- 
ceiver electronics. Considering a GSM/GPRS device, we take the power consumption of 
transmitter and receiver electronics as 150mW and 120mW respectively and we assume 
a 300mW transmit amplifier with 30% efficiency [8]. We consider 14kbps uplink and 
28kbps downlink bandwidth (typical for current GPRS technology). Note that sending 
data is more power consuming than receiving data. 2 

We simulated the MobiEyes approach using message sizes instead of message counts 
for messages exchanged and compared its power consumption due to communication 
with the naive and central optimal approaches. The graph in Figure 9 plots the per object 
power consumption due to communication as a function of number of queries. Since 
the naive approach require every object to send its new position to the server, its per 
object power consumption is the worst. In MobiEyes, however, a non-focal object does 
not send its position or velocity vector to the server, but it receives query updates from 
the server. Although the cost of receiving data in terms of consumed energy is lower 
than transmitting, given a fixed number of objects, for larger number of queries the 
central optimal approach outperforms MobiEyes in terms of power consumption due to 
communication. An important factor that increases the per object power consumption 
in MobiEyes is the fact that an object also receives updates regarding queries that are 
irrelevant mainly due to the difference between the size of a broadcast area and the 
monitoring region of a query. 



5.4 Computation on the Moving Object Side 

In this section we study the amount of computation placed on the moving object side 
by the MobiEyes approach for processing MQs. One measure of this is the number of 
queries a moving object has to evaluate at each time step, which is the size of the LQT 
(Recall Section 3.2). 

2 In this setting transmitting costs ~ 80 \xjules /bit and receiving costs ~ 5[ijules/bit 
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Figure 10 and Figure 1 1 study the effect of a and the effect of the total number of 
queries on the average number of queries a moving object has to evaluate at each time 
step (average LQT table size). The graph in Figure 10 plots the average LQT table 
size as a function of a for different number of queries. The graph in Figure 1 1 plots the 
same measure, but this time as a function of number of queries for different values of 
a. The first observation from these two figures is that the size of the LQT table does 
not exceeds 10 for the simulation setup. The second observation is that the average size 
of the LQT table increases exponentially with a where it increases linearly with the 
number of queries. 

Figure 12 studies the effect of the query radius 
on the number of average queries a moving object 
has to evaluate at each time step. The x-axis of 
the graph in Figure 12 represents the radius factor, 
whose value is used to multiply the original radius 
value of the queries. The y - axis represents the av- 
erage LQT table size. It is observed from the fig- 
ure that the larger query radius values increase the 
LQT table size. However this effect is only visible 
for radius values whose difference from each other 
is larger than the a. This is a direct result of the 
definition of the monitoring region from Section 2. 

Figure 13 studies the effect of the safe period 
optimization on the average query processing load 
of a moving object. The x-axis of the graph in Fig- 
ure 12 represents the a parameter, and the //-axis 
represents the average query processing load of a 
moving object. As a measure of query processing 
load, we took the average time spent by a moving object for processing its LQT table in 
the simulation. Figure 12 shows that for large values of a , the safe period optimization 
is very effective. This is because, as a gets larger, monitoring regions get larger, which 
increases the average distance between the focal object of a query and the objects in 
its monitoring region. This results in non-zero safe periods and decreases the cost of 
processing the LQT table. On the other hand, for very small values of a, like a = 1 in 
Figure 13, the safe period optimization incurs a small overhead. This is because the safe 
period is almost always less than the query evaluation period for very small a values 
and as a result the extra processing done for safe period calculations does not pay off. 




Fig. 13. Effect of the safe period opti- 
mization on the average query process- 
ing load of a moving object 



6 Related Work 

Evaluation of static spatial queries on moving objects, at a centralized location, is a well 
studied topic. In [14], Velocity Constrained Indexing and Query Indexing are proposed 
for efficient evaluation of this kind of queries at a central location. Several other indexing 
structures and algorithms for handling moving object positions are suggested in the 
literature [17,15,9,2,4,18]. There are two main points where our work departs from this 
line of work. 
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First, most of the work done in this respect has focused on efficient indexing structures 
and has ignored the underlying mobile communication system and the mobile objects. 
To our knowledge, only the SQM system introduced in [5] has proposed a distributed 
solution for evaluation of static spatial queries on moving objects, that makes use of the 
computational capabilities present at the mobile objects. 

Second, the concept of dynamic queries presented in [10] are to some extent similar 
to the concept of moving queries in MobiEyes. But there are two subtle differences. 
First, a dynamic query is defined as a temporally ordered set of snapshot queries in [10]. 
This is a low level definition. In contrast, our definition of moving queries is at end- 
user level, which includes the notion of a focal object. Second, the work done in [10] 
indexes the trajectories of the moving objects and describes how to efficiently evaluate 
dynamic queries that represent predictable or non-predictable movement of an observer. 
They also describe how new trajectories can be added when a dynamic query is actively 
running. Their assumptions are in line with their motivating scenario, which is to support 
rendering of objects in virtual tour-like applications. The MobiEyes solution discussed in 
this paper focuses on real-time evaluation of moving queries in real-world settings, where 
the trajectories of the moving objects are unpredictable and the queries are associated 
with moving objects inside the system. 

7 Conclusion 

We have described MobiEyes, a distributed scheme for processing moving queries on 
moving objects in a mobile setup. We demonstrated the effectiveness of our approach 
through a set of simulation based experiments. We showed that the distributed processing 
of MQs significantly decreases the server load and scales well in terms of messaging 
cost while placing only small amount of processing burden on moving objects. 
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Abstract. Clustering has become an increasingly important task in modem applica- 
tion domains such as marketing and purchasing assistance, multimedia, molecular bi- 
ology as well as many others. In most of these areas, the data are originally collected 
at different sites. In order to extract information from these data, they are merged at 
a central site and then clustered. In this paper, we propose a different approach. We 
cluster the data locally and extract suitable representatives from these clusters. These 
representatives are sent to a global server site where we restore the complete cluster- 
ing based on the local representatives. This approach is very efficient, because the lo- 
cal clustering can be carried out quickly and independently from each other. 
Furthermore, we have low transmission cost, as the number of transmitted represent- 
atives is much smaller than the cardinality of the complete data set. Based on this 
small number of representatives, the global clustering can be done very efficiently. 
For both the local and the global clustering, we use a density based clustering algo- 
rithm. The combination of both the local and the global clustering forms our new 
DBDC (Density Based Distributed Clustering) algorithm. Furthermore, we discuss 
the complex problem of finding a suitable quality measure for evaluating distributed 
clusterings. We introduce two quality criteria which are compared to each other and 
which allow us to evaluate the quality of our DBDC algorithm. In our experimental 
evaluation, we will show that we do not have to sacrifice clustering quality in order 
to gain an efficiency advantage when using our distributed clustering approach. 



1 Introduction 

Knowledge Discovery in Databases (KDD) tries to identify valid, novel, potentially useful, 
and ultimately understandable patterns in data. Traditional KDD applications require full 
access to the data which is going to be analyzed. All data has to be located at that site where 
it is scrutinized. Nowadays, large amounts of heterogeneous, complex data reside on differ- 
ent, independently working computers which are connected to each other via local or wide 
area networks (LANs or WANs). Examples comprise distributed mobile networks, sensor 
networks or supermarket chains where check-out scanners, located at different stores, gather 
data unremittingly. Furthermore, international companies such as DaimlerChrysler have 
some data which is located in Europe and some data in the US. Those companies have var- 
ious reasons why the data cannot be transmitted to a central site, e.g. limited bandwidth or 
security aspects. 

The transmission of huge amounts of data from one site to another central site is in some 
application areas almost impossible. In astronomy, for instance, there exist several highly 
sophisticated space telescopes spread all over the world. These telescopes gather data un- 
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ceasingly. Each of them is able to collect 1GB of data per hour [10] which can only, with 
great difficulty, be transmitted to a central site to be analyzed centrally there. On the other 
hand, it is possible to analyze the data locally where it has been generated and stored. Ag- 
gregated information of this locally analyzed data can then be sent to a central site where 
the information of different local sites are combined and analyzed. The result of the central 
analysis may be returned to the local sites, so that the local sites are able to put their data 
into a global context. 

The requirement to extract knowledge from distributed data, without a prior unification 
of the data, created the rather new research area of Distributed Knowledge Discovery in Da- 
tabases (DKDD). In this paper, we will present an approach where we first cluster the data 
locally. Then we extract aggregated information about the locally created clusters and send 
this information to a central site. The transmission costs are minimal as the representatives 
are only a fraction of the original data. On the central site we “reconstruct” a global cluster- 
ing based on the representatives and send the result back to the local sites. The local sites 
update their clustering based on the global model, e.g. merge two local clusters to one or 
assign local noise to global clusters. 

The paper is organized as follows, in Section 2, we shortly review related work in the 
area of clustering. In Section 3, we present a general overview of our distributed clustering 
algorithm, before we go into more detail in the following sections. In Section 4, we describe 
our local density based clustering algorithm. In Section 5, we discuss how we can represent 
a local clustering by relatively little information. In Section 6, we describe how we can re- 
store a global clustering based on the information transmitted from the local sites. Section 7 
covers the problem how the local sites update their clustering based on the global clustering 
information. In Section 8, we introduce two quality criteria which allow us to evaluate our 
new efficient DBDC (Density Based Distributed Clustering) approach. In Section 9, we 
present the experimental evaluation of the DBDC approach and show that its use does not 
suffer from a deterioration of quality. We conclude the paper in Section 10. 



2 Related Work 

In this section, we first review and classify the most common clustering algorithms. In 
Section 2.2, we shortly look at parallel clustering which has some affinity to distributed 
clustering. 

2.1 Clustering 

Given a set of objects with a distance function on them (i.e. a feature database), an interest- 
ing data mining question is, whether these objects naturally form groups (called clusters) 
and what these groups look like. Data mining algorithms that try to answer this question are 
called clustering algorithms. In this section, we classify well-known clustering algorithms 
according to different categorization schemes. 

Clustering algorithms can be classified along different, independent dimensions. One 
well-known dimension categorizes clustering methods according to the result they produce. 
Here, we can distinguish between hierarchical and partitioning clustering algorithms [13, 
15]. Partitioning algorithms construct a flat (single level) partition of a database D of n ob- 
jects into a set of k clusters such that the objects in a cluster are more similar to each other 
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Fig. 1. Classification scheme for clustering algorithms 



than to objects in different clusters. Hierarchical algorithms decompose the database into 
several levels of nested partitionings (clusterings), represented for example by a dendro- 
gram, i.e. a tree that iteratively splits D into smaller subsets until each subset consists of only 
one object. In such a hierarchy, each node of the tree represents a cluster of D. 

Another dimension according to which we can classify clustering algorithms is from an 
algorithmic point of view. Here we can distinguish between optimization based or distance 
based algorithms and density based algorithms. Distance based methods use the distances 
between the objects directly in order to optimize a global cluster criterion. In contrast, den- 
sity based algorithms apply a local cluster criterion. Clusters are regarded as regions in the 
data space in which the objects are dense, and which are separated by regions of low object 
density (noise). 

An overview of this classification scheme together with a number of important clustering 
algorithms is given in Figure 1. As we do not have the space to cover them here, we refer 
the interested reader to [ 1 5] were an excellent overview and further references can be found. 

2.2 Parallel Clustering and Distributed Clustering 

Distributed Data Mining (DDM) is a dynamically growing area within the broader field 
of KDD. Generally, many algorithms for distributed data mining are based on algorithms 
which were originally developed for parallel data mining. In [16] some state-of-the-art re- 
search results related to DDM are resumed. 

Whereas there already exist algorithms for distributed and parallel classification and as- 
sociation rules [2, 12, 17, 18, 20, 22], there do not exist many algorithms for parallel and 
distributed clustering. 

In [9] the authors sketched a technique for parallelizing a family of center-based data 
clustering algorithms. They indicated that it can be more cost effective to cluster the data 
in-place using an exact distributed algorithm than to collect the data in one central location 
for clustering. In [14] the “collective hierarchical clustering algorithm” for vertically dis- 
tributed data sets was proposed which applies single link clustering. In contrast to this ap- 
proach, we concentrate in this paper on horizontally distributed data sets and apply a 
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partitioning clustering. In [19] the authors focus on the reduction of the communication cost 
by using traditional hierarchical clustering algorithms for massive distributed data sets. 
They developed a technique for centroid-based hierarchical clustering for high dimensional, 
horizontally distributed data sets by merging clustering hierarchies generated locally. In 
contrast, this paper concentrates on density based partitioning clustering. 

In [21] a parallel version of DBSCAN [7] and in [5] a parallel version of k-means [11] 
were introduced. Both algorithms start with the complete data set residing on one central 
server and then distribute the data among the different clients. 

The algorithm presented in [5] distributes N objects onto P processors. Furthermore, k 
initial centroids are determined which are distributed onto the P processors. Each processor 
assigns each of its objects to one of the k centroids. Afterwards, the global centroids are up- 
dated (reduction operation). This process is carried out repeatedly until the centroids do not 
change any more. Furthermore, this approach suffers from the general shortcoming of 
k-means, where the number of clusters has to be defined by the user and is not determined 
automatically. 

The authors in [21] tackled these problems and presented a parallel version of DBDSAN. 
They used a ’shared nothing’ -architecture, where several processors where connected to 
each other. The basic data- structure was the dR*-tree, a modification of the R*-tree [3]. The 
dR*-tree is a distributed index-structure where the objects reside on various machines. By 
using the information stored in the dR*-tree, each local site has access to the data residing 
on different computers. Similar, to parallel k-means, the different computers communicate 
via message -passing. 

In this paper, we propose a different approach for distributed clustering assuming we 
cannot carry out a preprocessing step on the server site as the data is not centrally available. 
Furthermore, we abstain from an additional communication between the various client sites 
as we assume that they are independent from each other. 



3 Density Based Distributed Clustering 

Distributed Clustering assumes that the objects to be clustered reside on different sites. In- 
stead of transmitting all objects to a central site (also denoted as server) where we can apply 
standard clustering algorithms to analyze the data, the data are clustered independently on 
the different local sites (also denoted as clients). In a subsequent step, the central site tries 
to establish a global clustering based on the local models, i.e. the representatives. This is a 
very difficult step as there might exist dependencies between objects located on different 
sites which are not taken into consideration by the creation of the local models. In contrast 
to a central clustering of the complete dataset, the central clustering of the local models can 
be carried out much faster. 

Distributed Clustering is carried out on two different levels, i.e. the local level and the 
global level (cf. Figure 2). On the local level, all sites carry out a clustering independently 
from each other. After having completed the clustering, a local model is determined which 
should reflect an optimum trade-off between complexity and accuracy. Our proposed local 
models consist of a set of representatives for each locally found cluster. Each representative 
is a concrete object from the objects stored on the local site. Furthermore, we augment each 
representative with a suitable e-range value. Thus, a representative is a good approximation 
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for all objects residing on the corresponding local site which are contained in the e-range 
around this representative. 

Next the local model is transferred to a central site, where the local models are merged 
in order to form a global model. The global model is created by analyzing the local repre- 
sentatives. This analysis is similar to a new clustering of the representatives with suitable 
global clustering parameters. To each local representative a global cluster-identifier is as- 
signed. This resulting global clustering is sent to all local sites. 

If a local object belongs to the e-neighborhood of a global representative, the clus- 
ter-identifier from this representative is assigned to the local object. Thus, we can achieve 
that each site has the same information as if their data were clustered on a global site, to- 
gether with the data of all the other sites. 

To sum up, distributed clustering consists of four different steps (cf. Figure 2): 

• Local clustering 

• Determination of a local model 

• Determination of a global model, which is based on all local models 

• Updating of all local models 



4 Local Clustering 

As the data are created and located at local sites we cluster them there. The remaining 
question is “which clustering algorithm should we apply”. K-means [11] is one of the most 
commonly used clustering algorithms, but it does not perform well on data with outliers or 
with clusters of different sizes or non-globular shapes [8], The single link agglomerative 
clustering method is suitable for capturing clusters with non-globular shapes, but this ap- 
proach is very sensitive to noise and cannot handle clusters of varying density [8]. We used 
the density-based clustering algorithm DBSCAN [7], because it yields the following advan- 
tages: 

• DBSCAN is rather robust concerning outliers. 

• DBSCAN can be used for all kinds of metric data spaces and is not confined to vector 
spaces. 

• DBSCAN is a very efficient and effective clustering algorithm. 
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• There exists an efficient incremental version, which would allow incremental cluster- 
ings on the local sites. Thus, only if the local clustering changes “considerably”, we 
have to transmit a new local model to the central site [6], 

We slightly enhanced DBSCAN so that we can easily determine the local model after we 
have finished the local clustering. All information which is comprised within the local mod- 
el, i.e. the representatives and their corresponding e-ranges, is computed on-the-fly during 
the DBSCAN run. 

In the following, we describe DBSCAN in a level of detail which is indispensable for 
understanding the process of extracting suitable representatives (cf. Section 5). 

4.1 The Density-Based Partitioning Clustering-Algorithm DBSCAN 

The key idea of density-based clustering is that for each object of a cluster the neighbor- 
hood of a given radius (Eps) has to contain at least a minimum number of objects ( MinPts ), 
i.e. the cardinality of the neighborhood has to exceed some threshold. Density-based clus- 
ters can also be significantly generalized to density-connected sets. Density-connected sets 
are defined along the same lines as density-based clusters. 

We will first give a short introduction to DBSCAN. For a detailed presentation of 
DBSCAN see [7], 

Definition 1 (directly density-reachable). An object p is directly density-reachable from 
an object q wrt. Eps and MinPts in the set of objects D if 

• pe A r Eps(tf) (N E p S {q) is the subset of D contained in the £p.s-neighborhood of q ) 

• | N Eps (q ) | > MinPts (core-object condition) 

Definition 2 (density-reachable). An object p is density -reachable from an object q wrt. 
Eps and MinPts in the set of objects D. denoted as p > D q , if there is a chain of objects 
p j, .... p n , pj = q, p n = p such that p t efl and p i+1 is directly density-reachable from p t wrt. 
Eps and MinPts. 

Density-reachability is a canonical extension of direct density-reachability. This relation 
is transitive, but it is not symmetric. Although not symmetric in general, it is obvious that 
density-reachability is symmetric for objects o with |Al £ps (o)| > MinPts. Two “border ob- 
jects” of a cluster are possibly not density-reachable from each other because there are not 
enough objects in their £p.s-neighborhoods. However, there must be a third object in the 
cluster from which both “border objects” are density-reachable. Therefore, we introduce the 
notion of density-connectivity. 

Definition 3 (density-connected). An object p is density-connected to an object q wrt. Eps 
and MinPts in the set of objects D if there is an object o e D such that both, p and q are 
density-reachable from o wrt. Eps and MinPts in D. 

Density-connectivity is a symmetric relation. A cluster is defined as a set of density- 
connected objects which is maximal wrt. density-reachability and the noise is the set of ob- 
jects not contained in any cluster. 

Definition 4 (cluster). Let I) be a set of objects. A cluster C wrt. Eps and MinPts in I) is a 
non-empty subset of D satisfying the following conditions: 

• Maximality: Vp.g efl: if p e C and q > D p wrt. Eps and MinPts , then also q 6 C. 

• Connectivity: \/p,q e C: p is density-connected to q wrt. Eps and MinPts in D. 
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Definition 5 (noise). Let C C k be the clusters wrt. Eps and MinPts in D. Then, we define 
the noise as the set of objects in the database D not belonging to any cluster Cj, i.e. 
noise = [p eD | V i: p £ C, j. 

We omit the term “wrt. Eps and MinPts” in the following whenever it is clear from the 
context. There are different kinds of objects in a clustering: core objects (satisfying condi- 
tion 2 of definition 1) or non-core objects otherwise. In the following, we will refer to this 
characteristic of an object as the core object property of the object. The non-core objects in 
turn are either border objects (no core object but density-reachable from another core ob- 
ject) or noise objects (no core object and not density-reachable from other objects). 

The algorithm DBSCAN was designed to efficiently discover the clusters and the noise 
in a database according to the above definitions. The procedure for finding a cluster is based 
on the fact that a cluster as defined is uniquely determined by any of its core objects: first, 
given an arbitrary object p for which the core object condition holds, the set [o \ o > D p) of 
all objects o density-reachable from p in D forms a complete cluster C. Second, given a clus- 
ter C and an arbitrary core object p e C, C in turn equals the set [o | o > D p } (c.f. lemma 1 
and 2 in [7]). 

To find a cluster, DBSCAN starts with an arbitrary core object p which is not yet clus- 
tered and retrieves all objects density-reachable from p. The retrieval of density-reachable 
objects is performed by successive region queries which are supported efficiently by spatial 
access methods such as R*-trees [3] for data from a vector space or M-trees [4] for data from 
a metric space. 



5 Determination of a Local Model 

After having clustered the data locally, we need a small number of representatives which 
describe the local clustering result accurately. We have to find an optimum trade-off be- 
tween the following two opposite requirements: 

• We would like to have a small number of representatives. 

• We would like to have an accurate description of a local cluster. 

As the core points computed during the DBSCAN run contain in its £/?.s-neighborhood 
at least MinPts other objects, they might serve as good representatives. Unfortunately, their 
number can become very high, especially in very dense areas of clusters. In the following, 
we will introduce two different approaches for determining suitable representatives which 
are both based on the concept of specific core-points. 

Definition 6 (specific core points). Let D be a set of objects and let C e 2° be a cluster 
wrt. Eps and MinPts. Furthermore, let Cor c cCbe the set of core-points belonging to this 
cluster. Then Scor c cC is called a complete set of specific core points ofC iff the following 
conditions are true. 

• Scor c c Cor c 

• VspSj e Scor c : s^i s,- =* s t £ N Eps (sj) 

• Vce Core e Scor c '. c eN Eps (s) 

There might exist several different sets Scor c which fulfil Definition 6. Each of these 
sets Score usually consists of several specific core points which can be used to describe the 
cluster C. 
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The small example in Figure 3a shows that if A is an element of the set of specific 
core-points Scor, object B can not be included in Scor as it is located within the Eps- 
neighborhood of A. C might be contained in Scor as it is not in the £/A?-neighborhood of A. 
On the other hand, if B is within Scor, A and C are not contained in Scor as they are both in 
the E/w-neighborhood of B. The actual processing order of the objects during the DBSCAN 
run determines a concrete set of specific core points. For instance, if the core-point B is vis- 
ited first during the DBSCAN run, the core-points A and C are not included in Scor. 

In the following, we introduce two local models called, REP Scor (cf. Section 5.1) and 
REP k _ Means (cf. Section 5.2) which both create a local model based on the complete set of 
specific core points. 

5.1 Local Model: REPg cor 

In this model, we represent each local cluster C, by a complete set of specific core points 
Scot'c . If we assume that we have found n clusters C],..,C n on a local site k, the local model 
Loca\idodel k is formed by the union of the different sets Scor ^ . 

In the case of density-based clustering, very often several core points are in the 
E/w-neighborhood of another core point. This is especially true, if we have dense clusters 
and a large Eps- value. In Figure 3a, for instance, the two core points A and B are within the 
Ep,y-range of each other as dist{A, B) is smaller than Eps. 

Assuming core point A is a specific core point, i.e. A e Scor, than B g Scor because of 
condition 2 in Definition 6. In this case, object A should not only represent the objects in its 
own neighborhood, but also the objects in the neighborhood of B, i.e. A should represent all 
objects of N Eps (A) u N Eps (B). In order for A to be a representative for the objects 
N Eps (A) u N Eps (B), we have to assign a new specific e A -range to A with e A = Eps + dist{A,B) 
(cf. Figure 3a). Of course we have to assign such a specific e-range to all specific core 
points, which motivates the following definition: 

Definition 7 (specific e-ranges). Let CcDbea cluster wrt. Eps and MinPts. Furthermore, 
let Scor cCbea complete set of specific core-points. Then we assign to each s e Scor an 
e s -range indicating the represented area of s: 

e s := Eps + max{dist(s,Sj)\sj£ Cor a s t eN Eps (s)} . 

This specific e-range value is part of the local model and is evaluated on the server site 
to develop an accurate global model. Furthermore, it is very important for the updating proc- 
ess of the local objects. The specific e-range value is integrated into the local model of site 
k as follows: 

LocalModel k := LJ { (s,£ s ) | s e Scor c }. 

i G 1..72 

5.2 Local Model: REP k . Meam 

This approach is also based on the complete set of specific core-points. In contrast to the 
previous approach, the specific core points are not directly used to describe a cluster. In- 
stead, we use the number |Scor c | and the elements of Scor c as input parameters for a further 
“clustering step” with an adapted version of k-means. For each cluster C, found by 
DBSCAN, k-means yields \Scor (: \ centroids within C. These centroids are used as represen- 
tatives. The small example in Figure 3b shows that if object A is a specific core point, and 
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Fig. 3. Local models a) REPg cor \ specific core points and specific e-range b) REP/ < .][f eans : 
representatives by using k-means 

we apply an additional clustering step by using k-means, we get a more appropriate repre- 
sentative A’ . 

K-means is a partitioning based clustering method which needs as input parameters the 
number m of clusters which should be detected within a set M of objects. Furthermore, we 
have to provide m starting points for this algorithm, if we want to find m clusters. We use 
k-means as follows: 

• Each local cluster C which was found throughout the original DBSCAN run on the 
local site forms a set M of objects which is again clustered with k-means. 

• We ask k-means to find |5cor c | (sub)clusters within C, as all specific core points 
together yield a suitable number of representatives. Each of the centroids found by 
k-means within cluster C is then used as a new representative. Thus the number of rep- 
resentatives for each cluster is the same as in the previous approach. 

• As initial starting points for the clustering of C with k-means, we use the set of com- 
plete specific core points Scor c . 

Again, let us assume that there are n clusters Ci,..,C„ on a local site k. Furthermore, let 
c i i- c i \Scor c | b e ^ |.S’cwq| centroids found by the clustering of C , with k-means. Let 
Oj i c: C, be the set of objects which are assigned to the centroid c,y. Then we assign to each 
centroid Cjj an e c . -range, indicating the represented area by Cy, as follows: 

e c . . := max{dist(o,Cij)\o e Oy }. 

Finally, the local model, describing the n clusters on site k, can be generated analogously 
to the previous section as follows: 

LocalModelfc := U U (c t j, e c ) . 

i e l..n /£ l..|Scor c | 

6 Determination of a Global Model 

Each local model LocalModel ^ consists of a set of m k pairs, consisting of a representative 
r and an e-range value e r The number m of pairs transmitted from each site k is determined 
by the number n of clusters C,- found on site k and the number |5cor r( .| of specific core-points 
for each cluster C, as follows: 

m = 2_, \ Scor c\ . 

i = 1 ..n 
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Fig. 4. Determination of a global model a) local clusters b) local representatives c) determina- 
tion of a global model with Eps g i oba i = 2 'Epsj oca i 

Each of these pairs ( r , e r ) represent several objects which are all located in N e ( r ), i.e. the 
-neighborhood of r. All objects contained in N e (r) belongs to the same cluster. To put it an- 
other way, each specific local representative forms a cluster on its own. Obviously, we have 
to check whether it is possible to merge two or more of these clusters. These merged local rep- 
resentatives together with the unmerged local representatives form the global model. Thus, the 
global model consist of clusters consisting of one or of several local representatives. 

To find such a global model, we use the density based clustering algorithm DBSCAN 
again. We would like to create a clustering similar to the one produced by DBSCAN if ap- 
plied to the complete dataset with the local parameter settings. As we have only access to 
the set of all local representatives, the global parameter setting has to be adapted to this ag- 
gregated local information. 

As we assume that all local representatives form a cluster on their own it is enough to use 
a Min-Pts g i oba i-pammeteT of 2. If 2 representatives, stemming from the same or different lo- 
cal sites, are density connected to each other wrt. MinPts g i oba i and Eps g f obab then they be- 
long to the same global cluster. 

The question for a suitable Eps g j oba j value, is much more difficult. Obviously, Eps g j oba j 
should be greater than the Eps-parameter Epsj oca i used for the clustering on the local sites. 
For high Eps g j oba j values, we run the risk of merging clusters together which do not belong 
together. On the other hand, if we use small Eps g j oba j values, we might not be able to detect 
clusters belonging together. Therefore, we suggest that the Eps g j oba i parameter should be 
tunable by the user dependent on the e R values of all local representatives R. If these e R val- 
ues are generally high it is advisable to use a high Eps g i oba i\ alue. On the other hand, if the 
E r values are low, a small Eps g j oba j value is better. The default value which we propose is 
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equal to the maximum value of all e R values of all local representatives R. This default 
Eps global value is generally close to 2-Epsi oca i (cf. Section 9). 

In Figure 4, an example for Eps g i oba j=2-Epsi oca i is depicted. In Figure 4a the independ- 
ently detected clusters on site 1,2 and 3 are depicted. The cluster on site 1 is characterized 
by two representatives R1 and R2, whereas the clusters on site 2 and site 3 are only charac- 
terized by one representative as shown in Figure 4b. Figure 4c (VII) illustrates that all 4 
clusters from the different sites belong to one large cluster. Figure 4c (VIII) illustrates that 
an Eps„i oba i equal to Epsi oca i is insufficient to detect this global cluster. On the other hand, 
if we use an Eps g y oba i parameter equal to 2-EpSj oca i the 4 representatives are merged together 
to one large cluster (cf. Figure 4c (IX)). 

Instead of a user defined Eps g j oba i parameter, we could also use a hierarchical density 
based clustering algorithm, e.g. OPTICS [1], for the creation of the global model. This ap- 
proach would enable the user to visually analyze the hierarchical clustering structure for 
several £/?i ? / ofefl/ -parameters without running the clustering algorithm again and again. We 
refine from this approach because of several reasons. First, the relabeling process discussed 
in the next section would become very tedious. Second, a quantitative evaluation (cf. 
Section 9) of our DBDC algorithm is almost impossible. Third, the incremental version of 
DBSCAN allows us to start with the construction of the global model after the first repre- 
sentatives of any local model come in. Thus we do not have to wait for all clients to have 
transmitted their complete local models. 

7 Updating of the Local Clustering Based on the Global Model 

After having created a global clustering, we send the complete global model to all client 
sites. The client sites relabel all objects located on their site independently from each other. 
On the client site, two former independent clusters may be merged due to this new relabe- 
ling. Furthermore, objects which were formerly assigned to local noise are now part of a glo- 
bal cluster. If a local object o is in the e r -range of a representative r, o is assigned to the 
same global cluster as r. 

Figure 5 depicts an example for this relabeling process. The objects R1 and R2 are the 
local representatives. Each of them forms a cluster on its own. Objects A and B have been 
classified as noise. Representative R3 is a representative stemming from another site. As Rl, 
R2 and R3 belong to the same global cluster all Objects from the local clusters Cluster 1 and 
Cluster 2 are assigned to this global cluster. Furthermore, the objects A and B are assigned 
to this global cluster as they are within the e^-neighborhood of R3, i.e. A,B e N Er ^(R3). On 
the other hand, object C still belongs to noise as Ci N tR ^{R3). 

These updated local client clusterings help the clients to answer server questions effi- 
ciently, e.g. questions such as “ give me all objects on your site which belong to the global 
cluster 4711” . 



8 Quality of Distributed Clustering 

There exist no general quality measure which helps to evaluate the quality of a distribut- 
ed clustering. If we want to evaluate our new DBDC approach, we first have to tackle the 
problem of finding a suitable quality criterion. Such a suitable quality criterion should yield 
a high quality value if we compare a “good” distributed clustering to a central clustering. 
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i.e. reference clustering. On the other hand, it should yield a low value if we compare a 
“bad” distributed clustering to a central clustering. Needless to say, if we compare a refer- 
ence clustering to itself, the quality should be 100%. Let us first formally introduce the no- 
tion of a clustering. 

Definition 8 (clustering CL). Let D = ( x ( , ..., x n } be a database consisting of n objects. 
Then, we call any set CL a clustering of D w.r.t. MinPts, if it fulfils the following proper- 
ties: 

• CL c 2° 

• VC 6 CL: (|C| > MinPts) 

• VCj , C 2 e CL\ Cj * C, => Cj n C 2 = 0 

In the following we denote by CL ... a clustering resulting from our distributed ap- 
proach and by CL central our central reference clustering. We will define two different qual- 
ity criterions which measure the similarity between CL djstr and CL t . We compare 
the two introduced quality criterions to each other by discussing a small example. 

Let us assume that we have n objects, distributed over k sites. Our DBDC-algorithm, as- 
signs each object x, either to a cluster or to noise. We compare the result of our DBDC- 
algorithm to a central clustering of the n objects using DBSCAN. Then we assign to each 
object x a numerical value P (x) indicating the quality for this specific object. The overall 
quality of the distributed clustering is the mean of the qualities assigned to each object. 

Definition 9 (distributed clustering quality Qdbdc )■ Let D = {x t ..., x } be a database 
consisting of n objects. Let P be an object quality function P: D — > [0, 1 ] . Then the quality 
Qdbdc °f our distributed clustering w.r.t. P is computed as follows: 

2>,o 

Qdbdc = ~ 

The crucial question is “what is a suitable object quality function?”. In the following two 
subsections, we will discuss two different object functions P. 
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8.1 First Object Quality Function P 1 

Obviously, P(x) should yield a rather high value, if an object x together with many other 
objects is contained in a distributed cluster C d and a central cluster C c . In the case of density- 
based partitioning clustering, a cluster might consist of only MinPts elements. Therefore, 
the number of objects contained in two identical clusters might be not higher than MinPts. 
On the other hand, each cluster consists of at least MinPts elements. Therefore, asking for 
less than MinPts elements in both clusters would weakening the quality criterion unneces- 
sarily. 

If x is included in a distributed cluster C d but is assigned to noise by the central cluster- 
ing, the value of P(x) should be 0. If x is not contained in any distributed cluster, i.e. it is 
assigned to noise, a high object quality value requires that it is also not contained in a central 
cluster. In the following, we will define a discrete object quality function P 1 which assigns 
either 0 or 1 to an object x, i.e. P (x) = Oor P I (x) = 1. 

Definition 10 (discrete object quality P 1 ). Let x e D and let C d , C c be two cluster. Then 
we can define an object quality function P :£)—>{ 0, 1 } w.r.t. to a quality parameter qp as 
follows: 

0, xe Noise distr Axe Noise centraj 

0, xe Noise distr axe Noise central 

P l (x) = -1, x E Noise dlstr AXE Noise central 

1, xe Noise distr Axe Noise central a (| C d nC c \ i qp ) 

0, x e Noise dlstr Axe Noise central a (\C d n C c \< qp) 

The main advantage of the object quality function P 1 is that it is rather simple because it 
yields only a boolean return value, i.e. it tells whether an object was clustered correctly or 
falsely. Nevertheless, sometimes a more subtle quality measure is required which does not 
only assign a binary quality value to an object. In the following section, we will introduce a 
new object quality function which is not confined to the two binary quality values 0 and 1. 
This more sophisticated quality function can compute any value in between 0 and 1 which 
much better reflects the notion of “correctly clustered”. 

8.2 Second Object Quality Function P n 

The main idea of our new quality function is to take the number of elements which were 
clustered together with the object x during the distributed and the central clustering into con- 
sideration. Furthermore, we decrease the quality of x if there are objects which have been 
clustered together with x in only one of the two clusterings. 

Definition 11 (continuous object quality/^). Let x e D and let CL , C be a central and 
a distributed cluster. Then we define an object quality function P : D — > [0, 1 ] as follows: 

l x e Noise distr ax« Noise central 

0, xe Noise distr axe Noise centra 

p (x) = -1, x e Noise distr axe Noise centra , ■ 

1 Cd n Cc I ; otherwise 
| C d u C c | 
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Fig. 6. Used test data sets a) test data set A b) test data set B c) test data set C 



9 Experimental Evaluation 

We evaluated our DBDC - approach based on three different 2-dimensional point sets 
where we varied both the number of points and the characteristics of the point sets. Figure 
6 depicts the three used test data sets A (8700 objects, randomly generated data/cluster), B 
(4000 objects, very noisy data) and C (1021 objects, 3 clusters) on the central site. In order 
to evaluate our DBDC- approach, we equally distributed the data set onto the different client 
sites and then compared DBDC to a single run of DBSCAN on all data points. We carried 
out all local clusterings sequentially. Then, we collected all representatives of all local runs, 
and applied a global clustering on these representatives. For all these steps, we always used 
the same computer. The overall runtime was formed by adding the time needed for the glo- 
bal clustering to the maximum time needed for the local clusterings. All experiments were 
performed on a Pentium III/700 machine. 

In a first set of experiments, we consider efficiency aspects, whereas in the following 
sections we concentrate on quality aspects. 

9.1 Efficiency 

In Figure 7, we used test data sets with varying cardinalities to compare the 
overall runtime of our DBDC-algorithm to the runtime of a central clustering. Furthermore, 
we compared our two local models w.r.t. efficiency to each other. Figure 7a shows that our 
DBDC - approach outperforms a central clustering by far for large data sets. For instance, for 
a point set consisting of 100,000 points, both DBDC approaches, i.e. DBDC(Rep SCor ) and 
DBDC{Rep K _ Means ), outperform the central DBSCAN algorithm by more than one order of 
magnitude independent of the used local clustering. Furthermore, Figure 7a shows that the 
local model for REP Scor can more efficiently be computed than the local model for 

REP k-Means- 

Figure 7b shows that for small data sets our DBDC - approach is slightly slower than the 
central clustering approach. Nevertheless, the additional overhead for distributed clustering 
is almost negligible even for small data sets. 

In Figure 8 it is depicted in what way the overall runtime depends on the number of used 
sites. We compared DBDC based on REP Scor to a central clustering with DBSCAN. 
Our experiments show that we obtain a speed-up factor which is somewhere between 0(n ) 
and 0(n 2 ). This high speed-up factor is due to the fact that DBSCAN has a runtime com- 
plexity somewhere between O(nlogn) and 0(n 2 ) when using a suitable index structure, e.g. 
an R*-tree [3]. 
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Fig. 7. Overall runtime for central and distributed clustering dependent on the cardinality of 
the data set A. a) high number of data objects b) small number of data objects. 



9.2 Quality 

In the next set of experiments we evaluated the quality of our two introduced object qual- 
ity functions P 1 and P n together with the quality of our DBDC-approach. Figure 9a shows 
that the quality according to P 1 of both local models is very high and does not change if we 
vary the Eps g i oba f parameter during global clustering. On the other hand, if we look at Figure 
9b, we can clearly see that for Eps g i oba i parameters equal to 2-Eps [oca i, we get the best qual- 
ity for both local models. This is equal to the default value for the server site clustering 
which we derived in Section 6. Furthermore, the quality worsens for very high and very 
small Eps„i oba i parameters, which is in accordance to the quality which an experienced user 
would assign to those clusterings. 

To sum up, these experiments yield two basic insights: 

• The object quality function P n is more suitable than P 1 . 

• A good Eps g i oba i parameter is around 2-Epsi oca i 

Furthermore, the experiments indicate that the local model REP k _ Means yields slightly 
higher quality. 

For the following experiments, we used an Eps g i oba i parameter of 2-Epsi oca i. 
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Fig. 8. Overall runtime for central and distributed clustering DBDC(Rep SCor ) for a data set of 
203,000 points, a) dependent on the number of sites b) speed-up of DBDC compared to cen- 
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Fig. 9. Evaluation of object quality functions for varying Eps g j oba i parameters for data set A on 
4 local sites, a) object quality function P 1 b) object quality function P 11 

Figure 10 shows how the quality of our DBDC- approach depends on the number of 
client-sites. We can see that the quality according to P 1 is independent of the number of cli- 
ent sites which indicates again that this quality measure is unsuitable. On the other hand, the 
quality computed by P n is in accordance with the intuitive quality which an experienced 
user would assign to the distributed clusterings on the varying number of sites. Although, 
we have a slight decreasing quality for an increasing number of sites, the overall quality for 
both local models REP k _ Means and REP Scor is very high. 

Figure 1 1 shows that for the three different data sets A, B and C our DBDC- approach 
yields good results for both local models. The more accurate quality measure P n indicates 
that the DBDC- approach based on REP k _ Means yields a quality which reflects more ade- 
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Fig. 11. Quality for data sets A, B and C 



quately the user’s expectations. This is especially true for the rather noisy data set B, where 
P n yields the lower quality corresponding to the user’s intuition. 

To sum up, our new DBDC-approach based on REP k _ Means efficiently yields a very high 
quality even for a rather high number of local sites and data sets of various cardinalities and 
characteristics. 



10 Conclusions 

In this paper, we first motivated the need of distributed clustering algorithms. Due to 
technical, economical or security reasons, it is often not possible to transmit all data from 
different local sites to one central server site and then cluster the data there. Therefore, we 
have to apply an efficient and effective distributed clustering algorithm from which a lot of 
application ranges will benefit. We developed a partitioning distributed clustering algorithm 
which is based on the density-based clustering algorithm DBSCAN. We clustered the data 
locally and independently from each other and transmitted only aggregated information 
about the local data to a central server. This aggregated information consists of a set of pairs, 
comprising a representative r and an e-range value e r indicating the validity area of the rep- 
resentative. Based on these local models, we reconstruct a global clustering. This global 
clustering was carried out by means of standard DBSCAN where the two input-parameters 
Eps^iobai and MinPts ^ oba i were chosen such that the information contained in the local 
models are processed in the best possible way. The created global model is sent to all clients, 
which use this information to relable their own objects. 

As there exists no general quality measures which helps to evaluate the quality of a dis- 
tributed clustering, we introduced suitable quality criteria on our own. In the experimental 
evaluation, we discussed the suitability of our quality criteria and our density-based distrib- 
uted clustering approach. Based on the quality criteria, we showed that our new distributed 
clustering approach yields almost the same clustering quality as a central clustering on all 
data. On the other hand, we showed that we have an enormous efficiency advantage com- 
pared to a central clustering carried out on all data. 




DBDC: Density Based Distributed Clustering 105 



References 

1. Ankerst M., Breunig M. M., Kriegel H.-P., Sander J.: "OPTICS: Ordering Points To Identify the 
Clustering Structure”, Proc. ACM SIGMOD, Philadelphia, PA, 1999, pp. 49-60. 

2. Agrawal R„ Shafer J. C.: "Parallel mining of association rules: Design, implementation, and expe- 
rience” IEEE Trans. Knowledge and Data Eng. 8 (1996) 962-969 

3. Beckmann N„ Kriegel H.-P„ Schneider R„ Seeger B.: "The R*-tree: An Efficient and Robust Ac- 
cess Method for Points and Rectangles", Proc. ACM SIGMOD Int. Conf. on Management of Data 
(SIGMOD’90), Atlantic City, NJ, ACM Press, New York, 1990, pp. 3227331. 

4. Ciaccia P„ Patella M„ Zezula P.: "M-tree: An Efficient Access Method for Similarity Search in 
Metric Spaces ", Proc. 23rd Int. VLDB, Athens, Greece, 1997, pp. 426-435. 

5. Dhillon I. S„ Modh Dh. S.: "A Data-Clustering Algorithm On Distributed Memory Multiproces- 
sors", SIGKDD 99 

6. Ester M., Kriegel H.-P., Sander J., Wimmer M., Xu X.: "Incremental Clustering for Mining in a 
Data Warehousing Environment”, VLDB 98 

7. Ester M., Kriegel H.-P„ Sander J., Xu X.: "A Density-Based Algorithm for Discovering Clusters 
in Large Spatial Databases with Noise", Proc. 2nd Int. Conf. on Knowledge Discovery and Data 
Mining (KDD'96), Portland, OR, AAAI Press, 1996, pp.226-231. 

8. Ertoz L„ Steinbach M„ Kumar V.: "Finding Clusters of Different Sizes, Shapes, and Densities in 
Noisy, High Dimensional Data", SIAM International Conference on Data Mining (2003) 

9. Forman G„ Zhang B. : "Distributed Data Clustering Can Be Efficient and Exact". SIGKDD Ex- 
plorations 2(2): 34-38 (2000) 

10. Hanisch R. J.: "Distributed Data Systems and Services for Astronomy and the Space Sciences", in 
ASP Conf. Ser., Vol. 216, Astronomical Data Analysis Software and Systems IX, eds. N. Manset, 
C. Veillet, D. Crabtree (San Francisco: ASP) 2000 

1 1 . Hartigan J. A.: "Clustering Algorithms", Wiley, 1975 

12. Han E. H„ Karypis G., Kumar V.: "Scalable parallel data mining for association rales" In: SIG- 
MOD Record: Proceedings of the 1997 ACM-SIGMOD Conference on Management of Data, Tuc- 
son, AZ, USA. (1997) 277-288 

13. Jain A. K„ Dubes R.C.: "Algorithms for Clustering Data", Prentice-Hall Inc., 1988. 

14. Johnson E„ Kargupta H.: "Hierarchical Clustering From Distributed, Heterogeneous Data." In 
Zaki M. and Ho C., editors, Large-Scale Parallel KDD Systems. Lecture Notes in Computer Sci- 
ence, colum 1759, 221-244. Springer- Verlag, 1999 

15. Jain A. K„ Murty M. N„ Flynn P. J.:"Data Clustering: A Review", ACM Computing Surveys, Vol. 
31. No. 3, Sep. 1999, pp. 265-323. 

16. Kargupta H„ Chan P. (editors) : "Advances in Distributed and Parallel Knowledge Discovery", 
AAAI/MIT Press, 2000 

17. Shafer J., Agrawal R., Mehta M.: "A scalable parallel classifier for data mining" In: Proc. 22nd In- 
ternational Conference on VLDB. Mumbai, India. (1996) 

18. Srivastava A., Han E. H., Kumar V., Singh V.: "Parallel formulations of decision-tree classifica- 
tion algorithms" In: Proc. 1998 International Conference on Parallel Processing. (1998) 

19. Samatova N.F., Ostrouchov G„ Geist A., Melechko A.V.: "RACHET: An Efficient Cover-Based 
Merging of Clustering Hierarchies from Distributed Datasets, Distributed and Parallel Databases, 
11(2): 157-180; Mar 2002" 

20. Sayal M., Scheuermann P.: "A Distributed Clustering Algorithm for Web-Based Access Patterns", 
in Proceedings of the 2nd ACM-SIGMOD Workshop on Distributed and Parallel Knowledge Dis- 
covery, Boston, August 2000" 

21 . XuX., Jager J., H.-P. Kriegel.: "A Fast Parallel Clustering Algorithm for Large Spatial Databases", 
Data Mining and Knowledge Discovery, 3, 263-290 (1999), Kluwer Academic Publisher 

22. Zaki M. J... Parthasarathy S., Ogihara M., Li W.: "New parallel algorithms for fast discovery of 
association rule" Data Mining and Knowledge Discovery, 1, 343-373 (1997) 




Iterative Incremental Clustering of Time Series 



Jessica Lin, Michail Vlachos, Eamonn Keogh, and Dimitrios Gunopulos 



Computer Science & Engineering Department 
University of California, Riverside 
Riverside, CA 92521 

{ j essica , mvlachos , eamonn, dg}@cs . ucr . edu 



Abstract. We present a novel anytime version of partitional clustering algo- 
rithm, such as k-Means and EM, for time series. The algorithm works by lever- 
aging off the multi-resolution property of wavelets. The dilemma of choosing 
the initial centers is mitigated by initializing the centers at each approximation 
level, using the final centers returned by the coarser representations. In addition 
to casting the clustering algorithms as anytime algorithms, this approach has 
two other very desirable properties. By working at lower dimensionalities we 
can efficiently avoid local minima. Therefore, the quality of the clustering is 
usually better than the batch algorithm. In addition, even if the algorithm is run 
to completion, our approach is much faster than its batch counterpart. We 
explain, and empirically demonstrate these surprising and desirable properties 
with comprehensive experiments on several publicly available real data sets. 
We further demonstrate that our approach can be generalized to a framework of 
much broader range of algorithms or data mining problems. 



1 Introduction 

Clustering is a vital process for condensing and summarizing information, since it can 
provide a synopsis of the stored data. Although there has been much research on 
clustering in general, most classic machine learning and data mining algorithms do 
not work well for time series due to their unique structure. In particular, the high 
dimensionality, very high feature correlation, and the (typically) large amount of 
noise that characterize time series data present a difficult challenge. Although 
numerous clustering algorithms have been proposed, the majority of them work in a 
batch fashion, thus hindering interaction with the end users. Here we address the 
clustering problem by introducing a novel anytime version of partitional clustering 
algorithm based on wavelets. Anytime algorithms are valuable for large databases, 
since results are produced progressively and are refined over time [11], Their utility 
for data mining has been documented at length elsewhere [2, 21]. While partitional 
clustering algorithms and wavelet decomposition have both been studied extensively 
in the past, the major novelty of our approach is that it mitigates the problem 
associated with the choice of initial centers, in addition to providing the functionality 
of user-interaction. 

The algorithm works by leveraging off the multi-resolution property of wavelet 
decomposition [1, 6, 22], In particular, an initial clustering is performed with a very 
coarse representation of the data. The results obtained from this “quick and dirty” 
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clustering are used to initialize a clustering at a finer level of approximation. This 
process is repeated until the “approximation” is the original “raw” data. Our approach 
allows the user to interrupt and terminate the process at any level. In addition to 
casting the clustering algorithm as an anytime algorithm, our approach has two other 
very unintuitive properties. The quality of the clustering is often better than the batch 
algorithm, and even if the algorithm is run to completion, the time taken is typically 
much less than the time taken by the batch algorithm. 

We initially focus our approach on the popular k-Means clustering algorithm [10, 
18, 24] for time series. For simplicity we demonstrate how the algorithm works by 
utilizing the Haar wavelet decomposition. Then we extend the idea to another widely 
used clustering algorithm, EM, and another well-known decomposition method, DFT, 
towards the end of the paper. We demonstrate that our algorithm can be generalized 
as a framework for a much broader range of algorithms or data mining problems. 

The rest of this paper is organized as follows. In Section 2 we review related work, 
and introduce the necessary background on the wavelet transform and k-Means 
clustering. In Section 3, we introduce our algorithm. Section 4 contains a 
comprehensive comparison of our algorithm to classic k-Means on real datasets. In 
Section 5 we study how our approach can be extended to other iterative refinement 
method (such as EM), and we also investigate the use of other multi-resolution 
decomposition such as DFT. In Section 6 we summarize our findings and offer 
suggestions for future work. 



2 Background and Related Work 

Since our work draws on the confluence of clustering, wavelets and anytime 
algorithms, we provide the necessary background on these areas in this section. 



2.1 Background on Clustering 

One of the most widely used clustering approaches is hierarchical clustering, due to 
the great visualization power it offers [12]. Hierarchical clustering produces a nested 
hierarchy of similar groups of objects, according to a pairwise distance matrix of the 
objects. One of the advantages of this method is its generality, since the user does not 
need to provide any parameters such as the number of clusters. However, its 
application is limited to only small datasets, due to its quadratic (or higher order) 
computational complexity. 

A faster method to perform clustering is k-Means [2, 18]. The basic intuition 
behind k-Means (and in general, iterative refinement algorithms) is the continuous 
reassignment of objects into different clusters, so that the within-cluster distance is 
minimized. Therefore, if x are the objects and c are the cluster centers, k-Means 
attempts to minimize the following objective function: 

( 1 ) 

m = 1 i=l 

The k-Means algorithm for N objects has a complexity of Q(kNrD) [ 18], where k is 
the number of clusters specified by the user, r is the number of iterations until 



t N 
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convergence, and D is the dimensionality of the points. The shortcomings of the 
algorithm are its tendency to favor spherical clusters, and its requirement for prior 
knowledge on the number of clusters, k. The latter limitation can be mitigated by 
attempting all values of k within a large range. Various statistical tests can then be 
used to determine which value of k is most parsimonious. However, this approach 
only worsens k-Means’ already considerable time complexity. Since k-Means is 
essentiality a hill-climbing algorithm, it is guaranteed to converge on a local but not 
necessarily global optimum. In other words, the choices of the initial centers are 
critical to the quality of results. Nevertheless, in spite of these undesirable properties, 
for clustering large datasets of time-series, k-Means is preferable due to its faster running 
time. 



Table 1 . An outline of the k-Means algorithm 





Algorithm k-Means 


1 


Decide on a value for k. 


2 


Initialize the k cluster centers (randomly, if necessary). 


3 


Decide the class memberships of the N objects by 
assigning them to the nearest cluster center. 


4 


Re-estimate the k cluster centers, by assuming the 
memberships found above are correct. 


5 


If none of the N objects changed membership in the 
last iteration, exit. Otherwise goto 3. 



In order to scale the various clustering methods to massive datasets, one can either 
reduce the number of objects, N , by sampling [2], or reduce the dimensionality of the 
objects [1, 3, 9, 12, 13, 16, 19, 25, 26]. For time-series, the objective is to find a 
representation at a lower dimensionality that preserves the original information and 
describes the original shape of the time-series data as closely as possible. Many 
approaches have been suggested in the literature, including the Discrete Fourier 
Transform (DFT) [1, 9], Singular Value Decomposition [16], Adaptive Piecewise 
Constant Approximation [13], Piecewise Aggregate Approximation (PAA) [4, 26], 
Piecewise Linear Approximation [12] and the Discrete Wavelet Transform (DWT) [3, 
19]. While all these approaches have shared the ability to produce a high quality 
reduced-dimensionality approximation of time series, wavelets are unique in that their 
representation of data is intrinsically multi-resolution. This property is critical to our 
proposed algorithm and will be discussed in detail in the next section. 



2.2 Background on Wavelets 

Wavelets are mathematical functions that represent data or other functions in terms of 
the averages and differences of a prototype function, called the analyzing or mother 
wavelet [6], 

In this sense, they are similar to the Fourier transform. One fundamental difference 
is that wavelets are localized in time. In other words, some of the wavelet coefficients 
represent small, local subsections of the data being studied, as opposed to Fourier 
coefficients, which always represent global contributions to the data. This property is 
very useful for multi-resolution analysis of data. The first few coefficients contain an 
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overall, coarse approximation of the data; additional coefficients can be perceived as 
"zooming-in" to areas of high detail. Figs 1 and 2 illustrate this idea. 

The Haar Wavelet decomposition is achieved by averaging two adjacent values on 
the time series function at a given resolution to form a smoothed, lower-dimensional 
signal, and the resulting coefficients at this given resolution are simply the differences 
between the values and their averages [3], As a result, the Haar wavelet 
decomposition is the combination of the coefficients at all resolutions, with the 
overall average for the time series being its first coefficient. The coefficients are 
crucial for reconstructing the original sequence, as they store the detailed information 
lost in the smoothed signal. 




Fig. 1. The Haar Wavelet representation can 
be visualized as an attempt to approximate a 
time series with a linear combination of basis 
functions. In this case, time series A is 
transformed to B by Haar wavelet 
decomposition, and the dimensionality is 
reduced from 512 to 8. 




Fig. 2. The Haar Wavelet can represent data 
at different levels of resolution. Above we 
see a raw time series, with increasing 
faithful wavelet approximations below. 



2.3 Background on Anytime Algorithms 

Anytime algorithms are algorithms that trade execution time for quality of results 
[11]. In particular, an anytime algorithm always has a best-so-far answer available, 
and the quality of the answer improves with execution time. The user may examine 
this answer at any time, and choose to terminate the algorithm, temporarily suspend 
the algorithm, or allow the algorithm to run to completion. 

The utility of anytime algorithms for data mining has been extensively documented 
[2, 21]. Suppose a batch version of an algorithm takes a week to run (not an 
implausible scenario in mining massive, disk-resident data sets). It would be highly 
desirable to implement the algorithm as an anytime algorithm. This would allow a 
user to examine the best current answer after an hour or so as a “sanity check” of all 
assumptions and parameters. As a simple example, suppose the user had accidentally 
set the value of k to 50 instead of the desired value of 5. Using a batch algorithm the 
mistake would not be noted for a week, whereas using an anytime algorithm the 
mistake could be noted early on and the algorithm restarted with little cost. This 
motivating example could have been eliminated by user diligence! More generally, 
however, data mining algorithms do require the user to make choices of several 
parameters, and an anytime implementation of k-Means would allow the user to 
interact with the entire data mining process in a more efficient way. 
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2.4 Related Work 

Bradley et. al. [2] suggest a generic technique for scaling the k-Means clustering 
algorithms to large databases by attempting to identify regions of the data that are 
compressible, that must be retained in main memory, and regions that may be 
discarded. However, the generality of the method contrasts with our algorithm’s 
explicit exploitation of the structure of the data type of interest. 

Our work is more similar in spirit to the dynamic time warping similarity search 
technique introduced by Chu et. al. [4], The authors speed up linear search by 
examining the time series at increasingly finer levels of approximation. 



3 Our Approach - The I-kMeans Algorithm 

As noted in Section 2. 1 , the complexity of the k-Means algorithm is O(kNrD), where 
D is the dimensionality of data points (or the length of a sequence, as in the case of 
time-series). For a dataset consisting of long time-series, the D factor can burden the 
clustering task significantly. This overhead can be alleviated by reducing the data 
dimensionality. 

Another major drawback of the k-Means algorithm is that the clustering quality is 
greatly dependant on the choice of initial centers (i.e., line 2 of Table 1). As 
mentioned earlier, the k-Means algorithm guarantees local, but not necessarily global 
optimization. Poor choices of the initial centers, therefore, can degrade the quality of 
clustering solution and result in longer execution time (See [10] for an excellent 
discussion of this issue). Our algorithm addresses these two problems of k-Means, in 
addition to offering the capability of an anytime algorithm, which allows the user to 
interrupt and terminate the program at any stage. 

We propose using a wavelet decomposition to perform clustering at increasingly 
finer levels of the decomposition, while displaying the gradually refined clustering 
results periodically to the user. Note that any wavelet basis (or any other multi- 
resolution decomposition such as DFT) can be used, as will be demonstrated in 
Section 5. We opt for the Haar Wavelet here for its simplicity and its wide use in the 
time series community. 

We compute the Haar Wavelet decomposition for all time-series data in the 
database. The complexity of this transformation is linear to the dimensionality of each 
object; therefore, the running time is reasonable even for large databases. The process 
of decomposition can be performed off-line, and needs to be done only once. The 
time series data can be stored in the Haar decomposition format, which takes the same 
amount of space as the original sequence. One important property of the 
decomposition is that it is a lossless transformation, since the original sequence can 
always be reconstructed from the decomposition. 

Once we compute the Haar decomposition, we perform the k-Means clustering 
algorithm, starting at the second level (each object at level i has 2' 1 " 1 ’ dimensions) and 
gradually progress to finer levels. Since the Haar decomposition is completely 
reversible, we can reconstruct the approximation data from the coefficients at any 
level and perform clustering on these data. We call the new clustering algorithm I- 
kMeans, where I stands for “interactive,” Fig 3 illustrates this idea. 
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points for kmeans 
at level 2 



points for kmeans 
at level 3 



Level n 00 “ 0000 — 00 00 — 00 

Time Series 1 Time Series 2 Time Series k 



Fig. 3. k-Means is performed on each level on the reconstructed data from the Haar wavelet 
decomposition, starting with the second level 

The intuition behind this algorithm originates from the observation that the general 
shape of a time series sequence can often be approximately captured at a lower 
resolution. As shown in Fig 2, the shape of the time series is well preserved, even at 
very coarse approximations. Because of this desirable feature of wavelets, clustering 
results typically stabilize at a low resolution, thus saving time by eliminating the need 
to run at full resolution (the raw data). The pseudo-code of the algorithm is provided 
in Table 2. 

The algorithm achieves the speed-up by doing the vast majority of reassignments 
(Line 3 in Table 1) at the lower resolutions, where the costs of distance calculations 
are considerably lower. As we gradually progress to finer resolutions, we already start 
with good initial centers (the choices of initial centers will be discussed later in this 
section). Therefore, the number of iterations r until convergence will typically be 
much lower. 



Table 2. An outline of the I-kMeans algorithm 





Algorithm I-kMeans 


1 


Decide on a value for k. 


2 


Initialize the k cluster centers (randomly, if necessary). 


3 


Run the k-Means algorithm on the level, representation 
of the data 


4 


Use final centers from level,, as initial centers for level i+1 . 
This is achieved by projecting the k centers returned by 
k-Means algorithm for the 2' space in the 2 I+1 space. 


5 


If none of the N objects changed membership in the last 
iteration, exit. Otherwise goto 3. 



The 1-kMeans algorithm allows the user to monitor the quality of clustering results 
as the program executes. The user can interrupt the program at any level, or wait until 
the execution terminates once the clustering results stabilize. One surprising and 
highly desirable finding from the experimental results is that even if the program is 
run to completion (until the last level, with full resolution), the total execution time is 
generally less than that of clustering on raw data. 
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As mentioned earlier, on every level except for the starting level (i.e. level 2), 
which uses random initial centers, the initial centers are selected based on the final 
centers from the previous level. More specifically, the final centers computed at the 
end of level i will be used as the initial centers on level i+1. Since the length of the 
data reconstructed from the Haar decomposition doubles as we progress to the next 
level, we project the centers computed at the end of level i onto level i+1 by doubling 
each coordinate of the centers. This way, they match the dimensionality of the points 
on level i+1. For example, if one of the re-computed centers at the end of level 2 is 
(0.5, 1.2), then the initial center used for this cluster on level 3 is (0.5, 0.5, 1.2, 1.2). 
This approach resolves the dilemma associated with the choice of initial centers, 
which is crucial to the quality of clustering results [10]. It also contributes to the fact 
that our algorithm often produces better clustering results than the k-Means algorithm. 
More specifically, although our approach also uses random centers as initial centers, 
it’s less likely to be trapped in local minima at such a low dimensionality. In addition, 
the results obtained from this initial level will be refined in subsequent levels. Note 
that while the performance of k-Means can be improved by providing “good” initial 
centers, the same argument applies to our approach as well 1 . 

The algorithm can be further sped up by not reconstructing the time series. Rather, 
clustering directly on the wavelet coefficients will produce identical results. However, 
the projection technique for the final centers mentioned above would not be 
appropriate here. Instead, we can still reuse the final centers by simply padding the 
additional dimensions for subsequent levels with zeros. For brevity, we defer further 
discussion on this version to future work. 



4 Experimental Evaluation 

To show that our approach is superior to the k-Means algorithm for clustering time 
series, we performed a series of experiments on publicly available real datasets. For 
completeness, we ran the I-kMeans algorithm for all levels of approximation, and 
recorded the cumulative execution time and clustering accuracy at each level. In 
reality, however, the algorithm stabilizes in early stages and can automatically 
terminate much sooner. We compare the results with that of k-Means on the original 
data. Since both algorithms start with random initial centers, we execute each 
algorithm 100 times with different centers. However, for consistency we ensure that 
for each execution, both algorithms are seeded with the same set of initial centers. 
After each execution, we compute the error (more details will be provided in Section 
4.2) and the execution time on the clustering results. We compute and report the 
averages at the end of each experiment. By taking the average, we achieve better 
objectiveness than taking the best (minimum), since in reality, it’s unlikely that we 
would have the knowledge of the correct clustering results, or the “oracle,” to 
compare with (as was the case with one of our test datasets). 



1 As a matter of fact, our experiments (results not shown) with good initial centers show that 
this is true - while the performance of k-Means improves with good initial centers, the 
improvements on I-kMeans, in terms of both speed and accuracy, are even more drastic. 
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4.1 Datasets and Methodology 

We tested on two publicly available, real datasets. The dataset cardinalities range 
from 1,000 to 8,000. The length of each time series has been set to 512 on one dataset, 
and 1024 on the other. 

• JPL: This dataset consists of readings from various inertial sensors from Space 
Shuttle mission STS-57. The data is particularly appropriate for our experiments since 
the use of redundant backup sensors means that some of the data is very highly 
correlated. In addition, even sensors that measure orthogonal features (i.e. the X and 
Y axis) may become temporarily correlated during a particular maneuver; for 
example, a “roll reversal” [8]. Thus, the data has an interesting mixture of dense and 
sparse clusters. To generate data of increasingly larger cardinality, we extracted time 
series of length 512, at random starting points of each sequence from the original data 
pool. 

• Heterogeneous: This dataset is generated from a mixture of 10 real time series 
data from the UCR Time Series Data Mining Archive [14] (see Fig 4). Using the 10 
time-series as seeds, we produced variation of the original patterns by adding small 
time shifting (2-3% of the series length), and interpolated Gaussian noise. Gaussian 
noisy peaks are interpolated using splines to create smooth random variations. Fig 5 
illustrates how the data is generated. 




Fig. 4. Real time series data from UCR Time 
Series Data Mining Archive. We use these 
time series as seeds to create our 
Heterogeneous dataset. 



Fig. 5. Generation of variations on the 
heterogeneous data. We produced variation 
of the original patterns by adding small time 
shifting (2-3% of the series length), and 
interpolated Gaussian noise. Gaussian noisy 
peaks are interpolated using splines to 
create smooth random variations. 



In the Heterogeneous dataset, we know that the number of clusters is 10. However, 
for the JPL dataset, we lack this information. Finding k is an open problem for the k- 
Means algorithm and is out of scope of this paper. To determine the optimal k for k- 
Means, we attempt different values of k, ranging from 2 to 8. Nonetheless, our 
algorithm out-performs the k-Means algorithm regardless of k. In this paper we only 
show the results with k equals to 5. Fig 6 shows that our algorithm produces the same 
results as the hierarchical clustering algorithm, which is in generally more costly. 
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produces intuitive results. On the right-hand side, we show that hierarchical clustering (using 
average linkage) discovers the exact same clusters. However, hierarchical clustering is more 
costly than our algorithm. 



4.2 Error of Clustering Results 

In this section we compare the clustering quality for the I-kMeans and the classic k- 
Means algorithm. 

Since we generated the heterogeneous datasets from a set of given time series data, 
we have the knowledge of correct clustering results in advance. In this case, we can 
simply compute the clustering error by summing up the number of incorrectly 
classified objects for each cluster c and then dividing by the dataset cardinality. This 
is achieved by the use of a confusion matrix. Note the accuracy computed here is 
equivalent to “recall,” and the error rate is simply (1-accuracy). 

The error is computed at the end of each level. However, it’s worth mentioning that 
in reality, the correct clustering results would not be available in advance. The 
incorporation of such known results in our error calculation merely serves the purpose 
of demonstrating the quality of both algorithms. 

For the JPL dataset, we do not have prior knowledge of correct clustering results 
(which conforms more closely to real-life cases). Lacking this information, we cannot 
use the same evaluation to determine the error. 

Since the k-Means algorithm seeks to optimize the objective function by 
minimizing the sum of squared intra-cluster error, we evaluate the quality of 
clustering by using the objective functions. However, since the I-kMeans algorithm 
involves data with smaller dimensionality except for the last level, we have to map the 
cluster membership information to the original space, and compute the objective 
functions using the raw data in order to compare with the k-Means algorithm. We 
show that the objective functions obtained from the I-kMeans algorithm are better 
than those from the k-Means algorithm. The results are consistent with the work of 
Ding et. Al. [5], in which the authors show that dimensionality reduction reduces the 
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chances of the algorithm being trapped in a local minimum. Furthermore, even with 
the additional step of computing the objective functions from the original data, the I- 
kMeans algorithm still takes less time to execute than the k-Means algorithm. 

In Figs 7-8, we show the errors/objective functions from the I-kMeans algorithm as 
a fraction of those obtained from the k-Means algorithm. As we can see from the 
plots, our algorithm stabilizes at early (i.e. 2 nd or 3 rd ) stages and consistently results in 
smaller error than the classic k-Means algorithm. 





Fig. 7. Error of I-kMeans algorithm on the 
Heterogeneous dataset, presented as fraction 
of the error from the k-Means algorithm. 



Fig. 8. Objective functions of I-kMeans 
algorithm on the JPL dataset, presented as 
fraction of error from the k-Means algorithm. 



4.3 Running Time 

In this section, we present the cumulative running time for each level on the I-kMeans 
algorithm as a fraction to the k-Means algorithm. The cumulative running time for 
any level i is the total running time from the starting level to level i. In most cases, 
even if the I-kMeans algorithm is run to completion, the total running time is still less 
than that of the k-Means algorithm. We attribute this improvement to the good 
choices of initial centers, since they result in very few iterations until convergence. 
Nevertheless, we have already shown in the previous section that the I-kMeans 
algorithm finds the best result in relatively early stage and does not need to run 
through all levels. The time required for I-kMeans is therefore less than 50% of time 
required for k-Means for the Heterogeneous datasets. For the JPL datasets, the 
running time is less than 20% of time for k-Means, and even if it is run to completion, 
the cumulative running time is still 50% less than that of the k-Means algorithm. 





Fig. 9. Cumulative running time for the Fig. 10. Cumulative running time for the JPL 
Heterogeneous dataset. Our algorithm cuts dataset. Our algorithm typically takes only 
the running time by more than half. 20% of the time required for k-Means. 
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While the speedup achieved is quite impressive, we note that these results are only for 
the main memory case. We should expect a much greater speedup for more realistic 
data mining problems. The reason is that when performing k-Means on a massive 
dataset, every iteration requires a database scan [2]. The I/O time for the scans dwarfs 
the relatively inexpensive CPU time. In contrast, our multi-resolution approach is able 
to run its first few levels in main memory, building a good “approximate” model 2 
before being forced to access the disk. We can therefore expect our approach to make 
far fewer data scans. 



4.4 I-kMeans Algorithm vs. k-Means Algorithm 

In this section, rather than showing the error/objective function on each level, we 
present only the error/objective function returned by the I-kMeans algorithm when it 
out-performs the k-Means algorithm. We also present the time taken for the I-kMeans 
algorithm to stabilize (i.e. when the result does not improve anymore). We compare 
the results to those of the k-Means algorithm. The running time for I-kMeans remains 
small regardless of data size because the algorithm out-performs k-Means at very 
early stages. 



I-kMeans Alg vs. Kmeans Alg (Heterogeneous) 




error_l-kM Bans i i error_kmeans 

tlme_l-kM Bans —m time_kmeans 



Fig. 11. The I-kMeans algorithm is highly 
competitive with the k-Means algorithm. The 
errors (bars) and execution time (lines) are 
significantly smaller. 



I-kMeans Alg vs K-Means Alg (JPL) 




1000 2000 4000 8000 

Data Size 



obj. fcnJ-kMeans i i obj. fcn_k-Means 
time_l-kM eans — • time_k-Means 



Fig. 12. I-kMeans vs. k-Means algorithms 
in terms of objective function (bars) and 
running time (lines) for JPL dataset. 



Below we present additional results on a number of additional datasets from the UCR 
time-series archive. The running time recorded for I-kMeans is the time required to 
achieve the best result. Therefore, the speedup measured here is pessimistic, since I- 
kMeans typically outperforms k-Means in very early levels. We observe an average 
speedup of 3 times against the traditional k-Means and a general improvement in the 
objective function. Only in the earthquake dataset the cumulative time is more than k- 
Means. This happens because the algorithm has to traverse the majority of the levels 
in order to perform the optimal clustering. However, in this case the prolonged 
execution time can be balanced by the significant improvement in the objective 
function. 



2 As we have seen in Figs 7 and 8, the “approximate" models are typically better than the 
model built on the raw data. 







Iterative Incremental Clustering of Time Series 1 17 

Table 3. Performance of I-kMeans on additional datasets. Smaller numbers indicate better 
performance 



Dataset 


Obj. 

k-Means 


Obj. 

I-kMeans 


Time 

k-Means 


Time 

I-kMeans 


Speed 

Up 


Ballbeam 


6328.21 


6065.61 


5.83 


4.30 


1.36 


earthquake 


110159 


108887 


12.9 


15.19 


0.85 


sunspot 


4377E 6 


4361 E b 


7.45 


3.07 


3.36 


spot_exRates 


2496.66 


2497.47 


6.83 


2.71 


2.52 


powerplant 


9783E 8 


9584E 8 


10.33 


3.07 


3.36 


evaporator 


6303E 3 


6281 E 3 


21.33 


6.59 


3.24 


memory 


1921E 4 


1916E 4 


19.48 


5.91 


3.29 



5 Extension to a General Framework 

We have seen that our anytime algorithm out-performs k-Means in terms of clustering 
quality and running time. We will now extend the approach and generalize it to a 
framework that can adapt to a much broader range of algorithms. More specifically, 
we apply prominent alternatives on the frame of our approach, the clustering 
algorithm, as well as its essence, the decomposition method. 

We demonstrate the generality of the framework by two examples. Firstly, we use 
another widely-used iterative refinement algorithm - the EM algorithm, in place of 
the k-Means algorithm. We call this version of EM the I-EM algorithm. Next, instead 
of the Haar wavelet decomposition, we utilize an equally well-studied decomposition 
method, the Discrete Fourier Transform (DFT), on the I-kMeans algorithm. Both 
approaches have shown to outperform their k-Means or EM counterparts. In general, 
we can use any combination of iterative refining clustering algorithm and multi- 
resolution decomposition methods in our framework. 



5.1 I-EM with Expectation Maximization (EM) 

The EM algorithm with Gaussian Mixtures is very similar to k-Means algorithm 
introduced in Table 1. As with k-Means, the algorithm begins with an initial guess to 
the cluster centers (the “E” or Expectation step), and iteratively refines them (the “M” 
or maximization step). The major distinction is that k-Means attempts to model the 
data as a collection of k spherical regions, with every data object belonging to exactly 
one cluster. In contrast, EM models the data as a collection of k Gaussians, with every 
data object having some degree of membership in each cluster (in fact, although 
Gaussian models are most common, other distributions are possible). The major 
advantage of EM over k-Means is its ability to model a much richer set of cluster 
shapes. This generality has made EM (and its many variants and extensions) the 
clustering algorithm of choice in data mining [7] and bioinformatics [17]. 
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5.2 Experimental Results for I-EM 

Similar to the application of k-Means, we apply EM for different resolutions of data, 
and compare the clustering quality and running time with EM on the original data. We 
use the same datasets and parameters as in k-Means. However, we have to reduce the 
dimensionality of data to 256, since otherwise the dimensionality-cardinality ratio 
would be too small for EM to perform well (if at all!). The EM algorithm presents the 
error as the negative log likehood of data. We can compare the clustering results in a 
similar fashion as in k-Means, by projecting the results obtained at a lower dimension 
to the full dimension and computing the error on the original raw data. More 
specifically, this is achieved by re-computing the centers and the covariance matrix on 
the full dimension, given the posterior probabilities obtained at a lower dimension. 
The results are similar to those of k-Means. Fig 13 shows the errors for EM and I-EM 
algorithms on the JPL datasets. The errors for EM are shown as straight lines for easy 
visual comparison with I-EM at each level. The results show that I-EM outperforms 
EM at very early stages (4 or 8 dimensions). 

Fig 14 shows the running time for EM and I-EM on JPL datasets. As with the error 
presentation, the running times for EM are shown as straight lines for easy visual 
comparison with I-EM. The vertical dashed line indicates where I-EM starts to out- 
perform EM (as illustrated in Fig 13, I-EM out-performs EM at every level forward, 
following the one indicated by the dashed line). 
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Fig. 13. We show the errors for different data cardinalities. The errors for EM are presented as 
constant lines for easy visual comparison with the I-EM at each level. I-EM out-performs EM 
at very early stages (4 or 8 dimensions) 
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Fig. 14. Running times for different data cardinalities. The running times for EM are 
presented as constant lines for easy visual comparison with the I-EM at each level. The 
vertical dashed line indicates where I-EM starts to out-perform EM as illustrated in Fig 13 



5.3 I-kMeans with Discrete Fourier Transform 

As mentioned earlier, the choice of Haar Wavelet as the decomposition method is due 
to its efficiency and simplicity. In this section we extend the 1-kMeans to utilize 
another equally well-known decomposition method, the Discrete Fourier Transform 
(DFT) [1, 20]. 

Similar to the wavelet decomposition, DFT approximates the signal with a linear 
combination of basis functions. The vital difference between the two decomposition 
methods is that the wavelets are localized in time, while DFT coefficients represent 
global contribution of the signal. Fig 15 provides a side-by-side visual comparison of 
the Haar wavelet and DFT. 
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Fig. 15. Visual comparison of the Haar Wavelet and the Discrete Fourier Transform. 
Wavelet coefficients are localized in time, while DFT coefficients represent global 
contributions to the signal 
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Table 4. Objective functions for k-Means, I-kMeans with Haar Wavelet, and I-kMeans with 
DFT. Smaller numbers indicate tighter clusters. 



Dataset 


Obj. 

k-Means 


Haar 

I-kMeans 


DFT 

I-kMeans 


ballbeam 


6328.21 


6065.61 


6096.54 


earthquake 


110159 


108887 


108867 


Sunspot 


4377E 6 


4361 E 6 


4388E 6 


spot_exRates 


2496.66 


2497.47 


2466.28 


powerplant 


9783 E 8 


9584E“ 


11254E 8 


evaporator 


6303 E 8 


6281 E 8 


6290E 3 


Memory 


1921E 4 


1916E 4 


1803E 4 



While the competitiveness of either method has been largely argued in the past, we 
apply DFT in the algorithm to demonstrate the generality of the framework. As a 
matter of fact, consistent with the results shown in [15], the superiority of either 
method is highly data-dependent. In general, however, DFT performs better for 
smooth signals or sequences that resemble random walks. 



5.4 Experimental Results for I-kMeans with DFT 

In this section we show the quality of the results of I-kMeans, using DFT as the 
decomposition method instead of the Haar wavelet. Although there is no clear 
evidence that one decomposition method is superior than the other, it’s certain that 
using either one of these methods with I-kMeans outperforms the batch k-Means 
algorithm. Naturally it can be argued that instead of using our iterative method, one 
might be able to achieve equal-quality results by using a batch algorithm on higher 
resolution with either decomposition. While this is true to some extent, there is always 
a higher chance of the clustering being trapped in the local minima. By starting off at 
lower resolution and re-using the cluster centers each time, we minimize the dilemma 
with local minima, in addition to the choices of initial centers. 

In datasets where the time-series is approximated more faithfully by using Fourier 
than wavelet decomposition, the quality of the DFT-based incremental approach is 
slightly better. This experiment suggests that our approach can be tailored to specific 
applications, by carefully choosing the decomposition that provides the least 
reconstruction error. 

Table 4 shows the results of I-kMeans using DFT. 



6 Conclusions and Future Work 

We have presented an approach to perform incremental clustering of time-series at 
various resolutions using multi-resolution decomposition methods. We initially focus 
our approach on the k-Means clustering algorithm, and then extend the idea to EM. 
We reuse the final centers at the end of each resolution as the initial centers for the 
next level of resolution. This approach resolves the dilemma associated with the 
choices of initial centers and significantly improves the execution time and clustering 
quality. Our experimental results indicate that this approach yields faster execution 
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time than the traditional k-Means (or EM) approach, in addition to improving the 
clustering quality of the algorithm. Since it conforms with the observation that time 
series data can be described with coarser resolutions while still preserving a general 
shape, the anytime algorithm stabilizes at very early stages, eliminating the needs to 
operate on high resolutions. In addition, the anytime algorithm allows the user to 
terminate the program at any stage. 

Our extensions of the iterative anytime algorithm on EM and the multi-resolution 
decomposition on DFT show great promise for generalizing the approach at an even 
wider scale. More specifically, this anytime approach can be generalized to a 
framework with a much broader range of algorithms or data mining problem. For 
future work, we plan to investigate the following: 

• Extending our algorithm to other data types. For example, image histograms can 
be successfully represented as wavelets [6, 23]. Our initial experiments on image 
histograms show great promise of applying the framework on image data. 

• For k-Means, examining the possibility of re-using the results (i.e, objective 
functions that determine the quality of clustering results) from the previous 
stages to eliminate the need to re-compute all the distances. 
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Abstract. Clustering is a problem of great practical importance in numerous 
applications . The problem of clustering becomes more challenging when the data is 
categorical, that is, when there is no inherent distance measure between data values. 
We introduce LIMBO, a scalable hierarchical categorical clustering algorithm that 
builds on the Information Bottleneck (IB) framework for quantifying the relevant 
information preserved when clustering. As a hierarchical algorithm, LIMBO has 
the advantage that it can produce clusterings of different sizes in a single execution. 
We use the IB framework to define a distance measure for categorical tuples and 
we also present a novel distance measure for categorical attribute values. We show 
how the LIMBO algorithm can be used to cluster both tuples and values. LIMBO 
handles large data sets by producing a memory bounded summary model for 
the data. We present an experimental evaluation of LIMBO, and we study how 
clustering quality compares to other categorical clustering algorithms. LIMBO 
supports a trade-off between efficiency (in terms of space and time) and quality. 
We quantify this trade-off and demonstrate that LIMBO allows for substantial 
improvements in efficiency with negligible decrease in quality. 



1 Introduction 

Clustering is a problem of great practical importance that has been the focus of substan- 
tial research in several domains for decades. It is defined as the problem of partitioning 
data objects into groups, such that objects in the same group are similar, while objects in 
different groups are dissimilar. This definition assumes that there is some well defined 
notion of similarity , or distance, between data objects. When the objects are defined by 
a set of numerical attributes, there are natural definitions of distance based on geometric 
analogies. These definitions rely on the semantics of the data values themselves (for ex- 
ample, the values $100K and $1 10K are more similar than $100K and $1). The definition 
of distance allows us to define a quality measure for a clustering ( e.g ., the mean square 
distance between each point and the centroid of its cluster). Clustering then becomes 
the problem of grouping together points such that the quality measure is optimized. The 
problem of clustering becomes more challenging when the data is categorical, that is, 
when there is no inherent distance measure between data values. This is often the case in 
many domains, where data is described by a set of descriptive attributes, many of which 
are neither numerical nor inherently ordered in any way. As a concrete example, consider 
a relation that stores information about movies. For the purpose of exposition, a movie 
is a tuple characterized by the attributes “director”, “actor/actress”, and “genre”. An 
instance of this relation is shown in Table 1 . In this setting it is not immediately obvious 
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what the distance, or similarity, is between the values “Coppola” and “Scorsese”, or the 
tuples “Vertigo” and “Harvey”. 

Without a measure of distance between data values, it is unclear how to define a 
quality measure for categorical clustering. To do this, we employ mutual information, 
a measure from information theory. A good clustering is one where the clusters are 
informative about the data objects they contain. Since data objects are expressed in terms 
of attribute values, we require that the clusters convey information about the attribute 
values of the objects in the cluster. That is, given a cluster, we wish to predict the 
attribute values associated with objects of the cluster accurately. The quality measure of 
the clustering is then the mutual information of the clusters and the attribute values. Since 
a clustering is a summary of the data, some information is generally lost. Our objective 
will be to minimize this loss, or equivalently to minimize the increase in uncertainty as 
the objects are grouped into fewer and larger clusters. 



Table 1 . An instance of the movie database 





director 


actor 


genre 


C 


D 


ti (Godfather II) 


Scorsese 


De Niro 


Crime 


Cl 


di 


t 2 (Good Fellas) 


Coppola 


De Niro 


Crime 


Cl 


di 


t 3 (Vertigo) 


Hitchcock 


Stewart 


Thriller 


C2 


di 


f 4 (N by NW) 


Hitchcock 


Grant 


Thriller 


C2 


di 


£5 (Bishop’s Wife) 


Koster 


Grant 


Comedy 


C2 


d2 


te (Harvey) 


Koster 


Stewart 


Comedy 


C2 


d2 



Consider partitioning the tuples in Table 1 into two clusters. Clustering C groups the 
first two movies together into one cluster, ci, and the remaining four into another, c 2 . 
Note that cluster ci preserves all information about the actor and the genre of the movies 
it holds. For objects in c\, we know with certainty that the genre is “Crime”, the actor 
is “De Niro” and there are only two possible values for the director. Cluster C 2 involves 
only two different values for each attribute. Any other clustering will result in greater 
information loss. For example, in clustering D, (f is equally informative as ci, but d\ 
includes three different actors and three different directors. So, while in C 2 there are two 
equally likely values for each attribute, in d\ the director is any of “Scorsese”, “Coppola”, 
or “Hitchcock” (with respective probabilities 0.25, 0.25, and 0.50), and similarly for the 
actor. 

This intuitive idea was formalized by Tishby, Pereira and Bialek [20]. They recast 
clustering as the compression of one random variable into a compact representation 
that preserves as much information as possible about another random variable. Their 
approach was named the Information Bottleneck (IB) method, and it has been applied to 
a variety of different areas. In this paper, we consider the application of the IB method 
to the problem of clustering large data sets of categorical data. 

We formulate the problem of clustering relations with categorical attributes within 
the Information Bottleneck framework, and define dissimilarity between categorical data 
objects based on the IB method. Our contributions are the following. 
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• We propose LIMBO, the first scalable hierarchical algorithm for clustering categorical 
data based on the IB method. As a result of its hierarchical approach, LIMBO allows us 
in a single execution to consider clusterings of various sizes. LIMBO can also control 
the size of the model it builds to summarize the data. 

• We use LIMBO to cluster both tuples (in relational and market-basket data sets) and 
attribute values. We define a novel distance between attribute values that allows us to 
quantify the degree of interchangeability of attribute values within a single attribute. 

• We empirically evaluate the quality of clusterings produced by LIMBO relative to other 
categorical clustering algorithms including the tuple clustering algorithms IB, ROCK 
[13], and COOLCAT [4]; as well as the attribute value clustering algorithm STIRR [12], 
We compare the clusterings based on a comprehensive set of quality metrics. 

The rest of the paper is structured as follows. In Section 2, we present the IB method, 
and we describe how to formulate the problem of clustering categorical data within the 
IB framework. In Section 3, we introduce LIMBO and show how it can be used to cluster 
tuples. In Section 4, we present a novel distance measure for categorical attribute values 
and discuss how it can be used within LIBMO to cluster attribute values. Section 5 
presents the experimental evaluation of LIMBO and other algorithms for clustering 
categorical tuples and values. Section 6 describes related work on categorical clustering 
and Section 7 discusses additional applications of the LIMBO framework. 

2 The Information Bottleneck Method 

In this section, we review some of the concepts from information theory that will be 
used in the rest of the paper. We also introduce the Information Bottleneck method, and 
we formulate the problem of clustering categorical data within this framework. 

2.1 Information Theory Basics 

The following definitions can be found in any information theory textbook, e.g., [7], 
Let T denote a discrete random variable that takes values over the set T 1 , and let pit) 
denote the probability mass function of T. The entropy H (T) of variable T is defined 
by H(T) = —^2 tGT p(t)^ogp(t). Intuitively, entropy captures the “uncertainty” of 
variable T; the higher the entropy, the lower the certainty with which we can predict its 
value. 

Now, let T and A be two random variables that range over sets T and A respectively. 
The conditional entropy of A given T is defined as follows. 

H(A\T) = p{a\t) logp(a|f) 

iG T aeA 

Conditional entropy captures the uncertainty of predicting the values of variable A given 
the values of variable T. The mutual information , 7(T; A), quantifies the amount of 

1 For the remainder of the paper, we use italic capital letters (e.g., T) to denote random variables, 
and boldface capital letters (e.g., T) to denote the set from which the random variable takes 
values. 
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information that the variables convey about each other. Mutual information is symmetric, 
and non-negative, and it is related to entropy via the equation I(T\A) = H(T) — 
H(T\A) = H(A) - H(A\T). 

Relative Entropy, or Kullback-Leibler (KL) divergence, is an information-theoretic 
measure of the difference between two probability distributions. Given two distributions 
p and q over a set T, the relative entropy is defined as follows. 

DklIpWo] = 

te t ' 



2.2 Clustering Using the IB Method 

In categorical data clustering, the input to our problem is a set T of n tuples 
on to attributes A ly A 2 , . . . , A m . The domain of attribute A, ; is the set A, = 
{Aj.iq, Aj.i> 2 , . . . , A.j.v dt }, so that identical values from different attributes are treated 
as distinct values. A tuple t £ T takes exactly one value from the set A, for the i th 
attribute. Let A = Ai U • • • U A m denote the set of all possible attribute values. Let 
d = d\ + c ?2 + • • • + d m denote the size of A. The data can then be conceptualized as 
an n x d matrix M, where each tuple t £ T is a d-dimensional row vector in M. Matrix 
entry M[t, a] is 1, if tuple t contains attribute value a, and zero otherwise. Each tuple 
contains one value for each attribute, so each tuple vector contains exactly to 1 ’s. 

Now let T, A be random variables that range over the sets T (the set of tuples) and 
A (the set of attribute values) respectively. We normalize matrix A I so that the entries of 
each row sum up to 1. For some tuple t £ T. the corresponding row of the normalized 
matrix holds the conditional probability distribution p(A\t). Since each tuple contains 
exactly m attribute values, for some a £ A, p(a\t) = 1/m .if a appears in tuple t, 
and zero otherwise. Table 2 shows the normalized matrix M for the movie database 
example . 2 A similar formulation can be applied in the case of market-basket data, where 
each tuple contains a set of values from a single attribute [ 1 ]. 



Table 2. The normalized movie table 





d.S 


d.C 


d.H 


d.K 


a.DN 


a.S 


a.G 


g.Cr 


g-T 


g.c 


Pit) 


tl 


1/3 


0 


0 


0 


1/3 


0 


0 


1/3 


0 


0 


1/6 


£2 


0 


1/3 


0 


0 


1/3 


0 


0 


1/3 


0 


0 


1/6 


t3 


0 


0 


1/3 


0 


0 


1/3 


0 


0 


1/3 


0 


1/6 


t 4 


0 


0 


1/3 


0 


0 


0 


1/3 


0 


1/3 


0 


1/6 


^5 


0 


0 


0 


1/3 


0 


0 


1/3 


0 


0 


1/3 


1/6 




0 


0 


0 


1/3 


0 


1/3 


0 


0 


0 


1/3 


1/6 



A k-clustering C;, : of the tuples in T partitions them into k clusters C; = = 
{ci, C 2 , C 3 , ..., Cfc}, where each cluster Cj £ C & is a non-empty subset of T such that 
Ci fl Cj = 0 for all i,j, i 7 ^ j, and U* =1 Ci = T. Let Ck denote a random variable that 

2 We use abbreviations for the attribute values. For example d.H stands for director.Hitchcock. 
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ranges over the clusters in C k- We define k to be the size of the clustering. When k 
is fixed or when it is immaterial to the discussion, we will use C and C to denote the 
clustering and the corresponding random variable. 

Now, let C be a specific clustering. Giving equal weight to each tuple t G T, we 
define p(t) = A Then, for c G C, the elements of T, A, and C are related as follows. 

P( c ) = ~ and P( a l c ) = "TA J2 p ^ p ( a ^ 

7tc n p ^ 7^ 

We seek clusterings of the elements of T such that, for t G c, knowledge of the cluster 
identity, c, provides essentially the same prediction of, or information about, the values 
in A as does the specific knowledge of t. The mutual information /(A; C) measures the 
information about the values in A provided by the identity of a cluster in C. The higher 
/(A; C), the more informative the cluster identity is about the values in A contained 
in the cluster. Tishby, Pereira and Bialek [20], define clustering as an optimization 
problem, where, for a given number k of clusters, we wish to identify the fc-clustering 
that maximizes /(A; Ck)- Intuitively, in this procedure, the information contained in T 
about A is “squeezed” through a compact “bottleneck” clustering Ck, which is forced 
to represent the “relevant” part in T with respect to A. Tishby et al. [20] prove that, for a 
fixed number k of clusters, the optimal clustering C k partitions the objects in T so that 
the average relative entropy ]T cg Cfc teX P(^ c )ARTL[p( a K)lb( 0 l c )] i s minimized. 

Finding the optimal clustering is an NP-complete problem [11]. Slonim and 
Tishby [18] propose a greedy agglomerative approach, the Agglomerative Information 
Bottleneck (AIB) algorithm, for finding an informative clustering. The algorithm starts 
with the clustering C„, in which each object t G T is assigned to its own cluster. Due 
to the one-to-one mapping between C„ and T, /(A; C n ) = /(A; T). The algorithm 
then proceeds iteratively, for n — k steps, reducing the number of clusters in the cur- 
rent clustering by one in each iteration. At step n — I + 1 of the AIB algorithm, two 
clusters d,Cj in the ^-clustering C/ : are merged into a single component c* to pro- 
duce a new (£ — l)-clustering C^_i. As the algorithm forms clusterings of smaller size, 
the information that the clustering contains about the values in A decreases; that is, 
/(A; C(- 1 ) < /(A; Cf). The clusters c,; and Cj to be merged are chosen to minimize 
the information loss in moving from clustering Ce to clustering C^_i . This information 
loss is given by SI(ci, cf) = /(A; Cf) — /(A; CV_i). We can also view the information 
loss as the increase in the uncertainty. Recall that /(A; C) = H(A) — H(A\C). Since 
H(A) is independent of the clustering C, maximizing the mutual information /(A; C) 
is the same as minimizing the entropy of the clustering H(A\C). 

For the merged cluster c* = Ci U Cj, we have the following. 

p(c*) =p(Ci) +p(Cj) (1) 

I' ( -4|o , ) = ^DI= l ) + ^PDM (2) 

Tishby et al. [20] show that 



SI(d,Cj) = \p{ci) +p(c j )\D JS \p{A\c i ),p(A\c j )\ 



(3) 




128 



P. Andritsos et al. 



where Djs is the Jensen-Shannon (JS) divergence, defined as follows. Let pi = p{A\ci) 
and pj = p(A\cj) and let p = ^0jPi + j Pj ■ Then, the Djs distance is defined as 

follows. 

Djs[Pi,Pj] = ^^D KL [pi\\p] + ^^D KL \pj\\p} 

The Djg distance defines a metric and it is bounded above by one. We note that the 
information loss for merging clusters c, and Cj , depends only on the clusters Ci and Cj, 
and not on other parts of the clustering Cg. 

This approach considers all attribute values as a single variable, without taking 
into account the fact that the values come from different attributes. Alternatively, we 
could define a random variable for every attribute A,. We can show that in applying the 
Information Bottleneck method to relational data, considering all attributes as a single 
random variable is equivalent to considering each attribute independently [1], 

In the model of the data described so far, every tuple contains one value for each 
attribute. However, this is not the case when we consider market-basket data, which 
describes a database of transactions for a store, where every tuple consists of the items 
purchased by a single customer. It is also used as a term that collectively describes a data 
set where the tuples are sets of values of a single attribute, and each tuple may contain a 
different number of values. In the case of market-basket data, a tuple t, contains di values. 
Setting p(ti) = 1 jn and p(a\U) = 1/di, if a appears in t t , we can define the mutual 
information I(T;A) and proceed with the Information Bottleneck method to clusters the 
tuples. 

3 LIMBO Clustering 

The Agglomerative Information Bottleneck algorithm suffers from high computational 
complexity, namely 0(n 2 d 2 log n), which is prohibitive for large data sets. We now 
introduce the scaLable InforMation Bottleneck, LIMBO, algorithm that uses distribu- 
tional summaries in order to deal with large data sets. LIMBO is based on the idea that 
we do not need to keep whole tuples, or whole clusters in main memory, but instead, 
just sufficient statistics to describe them. LIMBO produces a compact summary model 
of the data, and then performs clustering on the summarized data. In our algorithm, we 
bound the sufficient statistics, that is the size of our summary model. This, together with 
an IB inspired notion of distance and a novel definition of summaries to produce the 
solution, makes our approach different from the one employed in the BIRCH clustering 
algorithm for clustering numerical data [21], In BIRCH a heuristic threshold is used to 
control the accuracy of the summary created. In the experimental section of this paper, 
we study the effect of such a threshold in LIMBO. 

3.1 Distributional Cluster Features 

We summarize a cluster of tuples in a Distributional Cluster Feature (DCF). We will 
use the information in the relevant DCF s to compute the distance between two clusters 
or between a cluster and a tuple. 
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Let T denote a set of tuples over a set A of attributes, and let T and A be the 
corresponding random variables, as described earlier. Also let C denote a clustering of 
the tuples in T and let C be the corresponding random variable. For some cluster c G C, 
the Distributional Cluster Feature (DCF) of cluster c is defined by the pair 

DCF(c) = (p(c),p(A\c)) 

where p(c ) is the probability of cluster c, and p{A\c) is the conditional probability 
distribution of the attribute values given the cluster c. We will often use DCF(c) and c 
interchangeably. 

If c consists of a single tuple t £ T,p(t) = 1/n, and p(A\t) is computed as described 
in Section 2. For example, in the movie database, for tuple t{, DCFitf) corresponds to 
the i fh row of the normalized matrix M in Table 2. For larger clusters, the DCF is 
computed recursively as follows: let c* denote the cluster we obtain by merging two 
clusters C\ and c<i . The DCF of the cluster c* is 

DCF(c*) = (p(c*),p(A\c*)) 

where p(c*) and p(A\c*) are computed using Equations 1, and 2 respectively. We de- 
fine the distance, d(ci,C 2 ), between DCF(ci) and DCF(c 2 ) as the information loss 
5I(ci,C2) incurred for merging the corresponding clusters c\ and C 2 . The distance 
d(c-\ , C 2 ) is computed using Equation 3. The information loss depends only on the clus- 
ters ci and C 2 , and not on the clustering C in which they belong. Therefore, d(c-\ , C 2 ) is 
a well-defined distance measure. 

The DCFs can be stored and updated incrementally. The probability vectors are 
stored as sparse vectors, reducing the amount of space considerably. Each DCF provides 
a summary of the corresponding cluster which is sufficient for computing the distance 
between two clusters. 

3.2 The DCF Tree 

The DCF tree is a height-balanced tree as depicted in Figure 1. Each node in the tree 
contains at most B entries, where B is the branching factor of the tree. All node entries 
store DCFs. At any point in the construction of the tree, the DCF s at the leaves define a 
clustering of the tuples seen so far. Each non-leaf node stores DCF s that are produced by 
merging the DCF s of its children. The DCF tree is built in a B-tree-like dynamic fashion. 
The insertion algorithm is described in detail below. After all tuples are inserted in the 
tree, the DCF tree embodies a compact representation where the data is summarized by 
the DCFs of the leaves. 

3.3 The LIMBO Clustering Algorithm 

The LIMBO algorithm proceeds in three phases. In the first phase, the DCF tree is 
constructed to summarize the data. In the second phase, the DCFs of the tree leaves are 
merged to produce a chosen number of clusters. In the third phase, we associate each 
tuple with the DCF to which the tuple is closest. 
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Root Node 




Fig. 1. A DCF tree with branching factor 6. 



Phase 1: Insertion into the DCF tree. Tuples are read and inserted one by one. Tuple 
t is converted into DCF(t), as described in Section 3.1. Then, starting from the root, 
we trace a path downward in the DCF tree. When at a non-leaf node, we compute the 
distance between DCF(t) and each DCF entry of the node, finding the closest DCF 
entry to DCF(t). We follow the child pointer of this entry to the next level of the tree. 
When at a leaf node, let DCF(c ) denote the DCF entry in the leaf node that is closest 
to DCF(t). DCF(c) is the summary of some cluster c. At this point, we need to decide 
whether t will be absorbed in the cluster c or not. 

In our space-bounded algorithm, an input parameter S indicates the maximum space 
bound. Let E be the maximum size of a DCF entry (note that sparse DCFs may be 
smaller than E). We compute the maximum number of nodes (N = S/ ( EB )) and keep 
a counter of the number of used nodes as we build the tree. If there is an empty entry 
in the leaf node that contains DCF(c), then DCF(t) is placed in that entry. If there is 
no empty leaf entry and there is sufficient free space, then the leaf node is split into two 
leaves. We find the two DCF s in the leaf node that are farthest apart and we use them as 
seeds for the new leaves. The remaining DCF s, and DCF(t) are placed in the leaf that 
contains the seed DCF to which they are closest. Finally, if the space bound has been 
reached, then we compare d(c. t) with the minimum distance of any two DCF entries 
in the leaf. If d(c,t) is smaller than this minimum, we merge DCF(t) with DCF{c)\ 
otherwise the two closest entries are merged and DCF(t) occupies the freed entry. 

When a leaf node is split, resulting in the creation of a new leaf node, the leaf’s 
parent is updated, and a new entry is created at the parent node that describes the newly 
created leaf. If there is space in the non-leaf node, we add a new DCF entry, otherwise 
the non-leaf node must also be split. This process continues upward in the tree until the 
root is either updated or split itself. In the latter case, the height of the tree increases by 
one. 

Phase 2: Clustering. After the construction of the DCF tree, the leaf nodes hold the 
DCF s of a clustering C of the tuples in T. Each l)CF(c) corresponds to a cluster c € C, 
and contains sufficient statistics for computing p(A\c), and probability p(c). We employ 
the Agglomerative Information Bottleneck (A1B) algorithm to cluster the DCFs in the 
leaves and produce clusterings of the DCFs. We note that any clustering algorithm is 
applicable at this phase of the algorithm. 
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Phase 3: Associating tuples with clusters. For a chosen value of k, Phase 2 produces k 
DCF s that serve as representatives of k clusters. In the final phase, we perform a scan 
over the data set and assign each tuple to the cluster whose representative is closest to 
the tuple. 



3.4 Analysis of LIMBO 

We now present an analysis of the I/O and CPU costs for each phase of the LIMBO 
algorithm. In what follows, n is the number of tuples in the data set, d is the total number 
of attribute values, B is the branching factor of the DCF tree, and k is the chosen number 
of clusters. 

Phase 1: The I/O cost of this stage is a scan that involves reading the data set from 
the disk. For the CPU cost, when a new tuple is inserted the algorithm considers a path 
of nodes in the tree, and for each node in the path, it performs at most B operations 
(distance computations, or updates), each taking time O(d). Thus, if h is the height of 
the DCF tree produced in Phase 1, locating the correct leaf node for a tuple takes time 
O(hdB). The time for a split is Q(dB 2 ). If U is the number of non-leaf nodes, then 
all splits are performed in time 0(dUB 2 ) in total. Hence, the CPU cost of creating the 
DCF tree is 0(nhdB + dUB 2 ). We observed experimentally that LIMBO produces 
compact trees of small height (both h and U are bounded). 

Phase 2: For values of S that produce clusterings of high quality the DCF tree is compact 
enough to fit in main memory. Hence, there is no I/O cost involved in this phase, since it 
involves only the clustering of the leaf node entries of the DCF tree. If L is the number of 
DCF entries at the leaves of the tree, then the AIB algorithm takes time 0(L 2 d 2 log L) . 
In our experiments, L <C n, so the CPU cost is low. 

Phase 3: The I/O cost of this phase is the reading of the data set from the disk again. 
The CPU complexity is 0(kdn), since each tuple is compared against the k DCF s that 
represent the clusters. 



4 Intra-attribute Value Distance 

In this section, we propose a novel application that can be used within LIMBO to 
quantify the distance between attribute values of the same attribute. Categorical data is 
characterized by the fact that there is no inherent distance between attribute values. For 
example, in the movie database instance, given the values “Scorsese” and “Coppola”, it 
is not apparent how to assess their similarity. Comparing the set of tuples in which they 
appear is not useful since every movie has a single director. In order to compare attribute 
values, we need to place them within a context. Then, two attribute values are similar if 
the contexts in which they appear are similar. We define the context as the distribution 
these attribute values induce on the remaining attributes. For example, for the attribute 
“director”, two directors are considered similar if they induce a “similar" distribution 
over the attributes “actor” and “genre”. 
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Table 3. The "director” attribute 



director 


a.DN 


a.S 


a.G 


g.Cr 


g-T 


g-c 


P(d) 


Scorsese 


1/2 


0 


0 


0 


1/2 


0 


1/6 


Coppola 


1/2 


0 


0 


0 


1/2 


0 


1/6 


Hitchcock 


0 


1/3 


1/3 


0 


2/3 


0 


2/6 


Koster 


0 


1/3 


1/3 


0 


0 


2/3 


2/6 



Formally, let A' be the attribute of interest, and let A' denote the set of values of 
attribute A'. Also let A = A \ A' denote the set of attribute values for the remaining 
attributes. For the example of the movie database, if A! is the director attribute, with 
A' = { d.S , d.C, d.H, d.K}, then A = { a.DN , a. S, a.G , g.Cr, g.T, g.C}. Let A! and 
A be random variables that range over A' and A respectively, and let p(A\v) denote the 
distribution that value tie A' induces on the values in A. For some a £ A, p(a|v) is 
the fraction of the tuples in T that contain v, and also contain value a. Also, for some 
v £ A', p(v) is the fraction of tuples in T that contain the value v. Table 3 shows an 
example of a table when A' is the director attribute. 

For two values V\,V2 £ A', we define the distance between V\ and t>2 to be the 
information loss SI(v\, V2), incurred about the variable A if we merge values v\ and 
V2- This is equal to the increase in the uncertainty of predicting the values of variable 
A, when we replace values v-\ and V2 with v\ V V2- In the movie example, Scorsese and 
Coppola are the most similar directors . 3 



The definition of a distance measure for categorical attribute values is a contribution 
in itself, since it imposes some structure on an inherently unstructured problem. We can 
define a distance measure between tuples as the sum of the distances of the individual 
attributes. Another possible application is to cluster intra-attribute values. For example, 
in a movie database, we may be interested in discovering clusters of directors or actors, 
which in turn could help in improving the classification of movie tuples. Given the 
joint distribution of random variables A' and A we can apply the LIMBO algorithm for 
clustering the values of attribute A! . Merging two V\ , V2 £ A', produces a new value 
iq V V2, where p(v\ V V2) = p(y\) +p{ V2), since V\ and i>2 never appear together. Also, 
P {a\v\ V v 2 ) = / V ^l 2 ) p(a\vi) + ^ 0 ^p(a\v 2 ). 

The problem of defining a context sensitive distance measure between attribute val- 
ues is also considered by Das and Mannila [ 9 ]. They define an iterative algorithm for 
computing the interchangeability of two values. We believe that our approach gives a 
natural quantification of the concept of interchangeability. Furthermore, our approach 
has the advantage that it allows for the definition of distance between clusters of val- 
ues, which can be used to perform intra-attribute value clustering. Gibson et al. [ 12 ] 
proposed STIRR, an algorithm that clusters attribute values. STIRR does not define a 
distance measure between attribute values and, furthermore, produces just two clusters 
of values. 



3 A conclusion that agrees with a well-informed cinematic opinion. 
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5 Experimental Evaluation 

In this section, we perform a comparative evaluation of the LIMBO algorithm on both real 
and synthetic data sets, with other categorical clustering algorithms, including what we 
believe to be the only other scalable information-theoretic clustering algorithm COOL- 
CAT [3,4], 

5.1 Algorithms 

We compare the clustering quality of LIMBO with the following algorithms. 

ROCK Algorithm. ROCK [13] assumes a similarity measure between tuples, and de- 
fines a link between two tuples whose similarity exceeds a threshold 0. The aggregate 
interconnectivity between two clusters is defined as the sum of links between their tu- 
ples. ROCK is an agglomerative algorithm, so it is not applicable to large data sets. We 
use the Jaccard Coefficient for the similarity measure as suggested in the original paper. 
For data sets that appear in the original ROCK paper, we set the threshold 0 to the value 
suggested there, otherwise we set 6 to the value that gave us the best results in terms of 
quality. In our experiments, we use the implementation of Guha et al. [13]. 

COOLCAT Algorithm. The approach most similar to ours is the COOLCAT algo- 
rithm [3,4], by Barbara, Couto and Li. The COOLCAT algorithm is a scalable algorithm 
that optimizes the same objective function as our approach, namely the entropy of the 
clustering. It differs from our approach in that it relies on sampling, and it is non- 
hierarchical. COOLCAT starts with a sample of points and identifies a set of k initial 
tuples such that the minimum pairwise distance among them is maximized. These serve 
as representatives of the k clusters. All remaining tuples of the data set are placed in 
one of the clusters such that, at each step, the increase in the entropy of the resulting 
clustering is minimized. For the experiments, we implement COOLCAT based on the 
CIKM paper by Barbara et al. [4], 

STIRR Algorithm. STIRR [12] applies a linear dynamical system over multiple copies 
of a hypergraph of weighted attribute values, until a fixed point is reached. Each copy 
of the hypergraph contains two groups of attribute values, one with positive and an- 
other with negative weights, which define the two clusters. We compare this algorithm 
with our intra-attribute value clustering algorithm. In our experiments, we use our own 
implementation and report results for ten iterations. 

LIMBO Algorithm. In addition to the space-bounded version of LIMBO as described 
in Section 3, we implemented LIMBO so that the accuracy of the summary model is 
controlled instead. If we wish to control the accuracy of the model, we use a threshold 
on the distance d(c,t) to determine whether to merge DCF(t) with DCF(c), thus 
controlling directly the information loss for merging tuple t, with cluster c. The selection 
of an appropriate threshold value will necessarily be data dependent and we require 
an intuitive way of allowing a user to set this threshold. Within a data set, every tuple 
contributes, on “average”, I(A\T)/n to the mutual information /( A\ T) . We define the 
clustering threshold to be a multiple <f> of this average and we denote the threshold by 
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t ((/)). That is, r{(j)) = (f> ,< - A ' Tl . We can make a pass over the data, or use a sample 
of the data, to estimate I(A\ T). Given a value for 4 > (0 < <fi <C n), if a merge incurs 
information loss more than 6 times the “average” mutual information, then the new tuple 
is placed in a cluster by itself. In the extreme case <j> = 0.0, we prohibit any information 
loss in our summary (this is equivalent to setting S' = oo in the space-bounded version 
of LIMBO). We discuss the effect of cj> in Section 5.4. 

To distinguish between the two versions of LIMBO, we shall refer to the space- 
bounded version as LIMBOg and the accuracy-bounded as LIMBO^. Note that algo- 
rithmically only the merging decision in Phase 1 differs in the two versions, while all 
other phases remain the same for both LIMBO 5 and LIMBO,,,. 

5.2 Data Sets 

We experimented with the following data sets. The first three have been previously used 
for the evaluation of the aforementioned algorithms [4,12,13]. The synthetic data sets 
are used both for quality comparison, and for our scalability evaluation. 

Congressional Votes. This relational data set was taken from the U Cl Machine Learning 
Repository. 4 It contains 435 tuples of votes from the U.S. Congressional Voting Record 
of 1984. Each tuple is a congress-person’s vote on 16 issues and each vote is boolean, 
either YES or NO. Each congress-person is classified as either Republican or Democrat. 
There are a total of 168 Republicans and 267 Democrats. There are 288 missing values 
that we treat as separate values. 

Mushroom. The Mushroom relational data set also comes from the UCI Repository. 
It contains 8,124 tuples, each representing a mushroom characterized by 22 attributes, 
such as color, shape, odor, etc. The total number of distinct attribute values is 1 17. Each 
mushroom is classified as either poisonous or edible. There are 4,208 edible and 3,916 
poisonous mushrooms in total. There are 2,480 missing values. 

Database and Theory Bibliography. This relational data set contains 8,000 tuples that 
represent research papers. About 3,000 of the tuples represent papers from database 
research and 5,000 tuples represent papers from theoretical computer science. Each 
tuple contains four attributes with values for the first Author, second Author, Confer- 
ence/Journal and the Year of publication . 5 We use this data to test our intra-attribute 
clustering algorithm. 

Synthetic Data Sets. We produce synthetic data sets using a data generator available on 
the Web . 6 This generator offers a wide variety of options, in terms of the number of tuples, 
attributes, and attribute domain sizes. We specify the number of classes in the data set by 
the use of conjunctive rules of the form (Attri = a\/\Attr 2 = 02 A. . .) => Class = cl. 
The rules may involve an arbitrary number of attributes and attribute values. We name 

4 http : //www . ics .uci . edu/~mlearn/MLRepository .html 

5 Following the approach of Gibson et al. [12], if the second author does not exist, then the 
name of the first author is copied instead. We also filter the data so that each conference/journal 
appears at least 5 times. 

6 http : //www . datgen. com/ 
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these synthetic data sets by the prefix DS followed by the number of classes in the 
data set, e.g., DS5 or DS10. The data sets contain 5,000 tuples, and 10 attributes, with 
domain sizes between 20 and 40 for each attribute. Three attributes participate in the 
rules the data generator uses to produce the class labels. Finally, these data sets have up 
to 10% erroneously entered values. Additional larger synthetic data sets are described 
in Section 5.6. 

Web Data. This is a market-basket data set that consists of a collection of web pages. 
The pages were collected as described by Kleinberg [14]. A query is made to a search 
engine, and an initial set of web pages is retrieved. This set is augmented by including 
pages that point to, or are pointed to by pages in the set. Then, the links between the pages 
are discovered, and the underlying graph is constructed. Following the terminology of 
Kleinberg [14] we define a hub to be a page with non-zero out-degree, and an authority 
to be a page with non-zero in-degree. 

Our goal is to cluster the authorities in the graph. The set of tuples T is the set 
of authorities in the graph, while the set of attribute values A is the set of hubs. Each 
authority is expressed as a vector over the hubs that point to this authority. For our 
experiments, we use the data set used by Borodin et al. [5] for the “abortion” query. We 
applied a filtering step to assure that each hub points to more than 10 authorities and each 
authority is pointed by more than 10 hubs. The data set contains 93 authorities related 
to 102 hubs. 

We have also applied LIMBO on Software Reverse Engineering data sets with con- 
siderable benefits compared to other algorithms [2]. 



5.3 Quality Measures for Clustering 

Clustering quality lies in the eye of the beholder; determining the best clustering usually 
depends on subjective criteria. Consequently, we will use several quantitative measures 
of clustering performance. 

Information Loss, ( IL ); We use the information loss, I (A; T) — I (A; C) to compare 
clusterings. The lower the information loss, the better the clustering. For a clustering 
with low information loss, given a cluster, we can predict the attribute values of the 
tuples in the cluster with relatively high accuracy. We present IL as a percentage of the 
initial mutual information lost after producing the desired number of clusters using each 
algorithm. 

Category Utility, ( CU ): Category utility [15], is defined as the difference between the 
expected number of attribute values that can be correctly guessed given a clustering, and 
the expected number of correct guesses with no such knowledge, CU depends only on 
the partitioning of the attributes values by the corresponding clustering algorithm and, 
thus, is a more objective measure. Let C be a clustering. If ,1 , is an attribute with values 
Vij, then CU is given by the following expression: 

cu = E t E E[ p ( A * = %i c ) 2 - p ( A * = %) 2 ] 

cGC ^3 
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We present CU as an absolute value that should be compared to the CU values given 
by other algorithms, for the same number of clusters, in order to assess the quality of a 
specific algorithm. 

Many data sets commonly used in testing clustering algorithms include a variable 
that is hidden from the algorithm, and specifies the class with which each tuple is as- 
sociated. All data sets we consider include such a variable. This variable is not used 
by the clustering algorithms. While there is no guarantee that any given classification 
corresponds to an optimal clustering, it is nonetheless enlightening to compare cluster- 
ings with pre-specified classifications of tuples. To do this, we use the following quality 
measures. 

Min Classification Error, Assume that the tuples in T are already classified 

into k classes G = {(j -\ ..... f/A- }, and let C denote a clustering of the tuples in T 
into k clusters {ci, . . . , eg- } produced by a clustering algorithm. Consider a one-to-one 
mapping, /, from classes to clusters, such that each class g t is mapped to the cluster 
f(gi). The classification error of the mapping is defined as 

k 

e = ^ 2 \ ^ n /(&) 

2—1 



where | gi n f(gi)\ measures the number of tuples in class gi that received the wrong 
label. The optimal mapping between clusters and classes, is the one that minimizes 
the classification error. We use E min to denote the classification error of the optimal 
mapping. 

Precision, (P), Recall, (R): Without loss of generality assume that the optimal mapping 
assigns class gi to cluster c, . We define precision, Pj, and recall. If, for a cluster c t , 
1 < i < k as follows. 



P, = 



\cj ngi| 

N 



a nd Ri = 



I CiHgil 
\9i\ 



Pi and Ri take values between 0 and 1 and, intuitively, P, measures the accuracy with 
which cluster c,; reproduces class g t , while P { ; measures the completeness with which c,; 
reproduces class g, . We define the precision and recall of the clustering as the weighted 
average of the precision and recall of each cluster. More precisely 



p = vM Pi and 

km\ km 

We think of precision, recall, and classification error as indicative values (percentages) 
of the ability of the algorithm to reconstruct the existing classes in the data set. 

In our experiments, we report values for all of the above measures. For LIMBO and 
COOLCAT, numbers are averages over 100 runs with different (random) orderings of 
the tuples. 
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Fig. 2. LIMBOs and LIMBO,/, execution times (DS5) 



5.4 Quality-Efficiency Trade-Offs for LIMBO 

In LIMBO, we can control the size of the model (using S ) or the accuracy of the model 
(using </>). Both S and 0 permit a trade-off between the expressiveness (information 
preservation) of the summarization and the compactness of the model (number of leaf 
entries in the tree) it produces. For large values of S and small values of 0, we obtain a 
fine grain representation of the data set at the end of Phase 1. However, this results in a 
tree with a large number of leaf entries, which leads to a higher computational cost for 
both Phase 1 and Phase 2 of the algorithm. For small values of S and large values of 0, 
we obtain a compact representation of the data set (small number of leaf entries), which 
results in faster execution time, at the expense of increased information loss. 

We now investigate this trade-off for a range of values for S and <b. We observed 
experimentally that the branching factor B does not significantly affect the quality of 
the clustering. We set B = 4, which results in manageable execution time for Phase 1. 
Figure 2 presents the execution times for LIMBOs and LIMBO,, on the DS5 data set, 
as a function of S and q b, respectively. For 0 = 0.25 the Phase 2 time is 210 seconds 
(beyond the edge of the graph). The figures also include the size of the tree in KBytes. 
In this figure, we observe that for large S and small 0 the computational bottleneck of 
the algorithm is Phase 2. As S decreases and 0 increases the time for Phase 2 decreases 
in a quadratic fashion. This agrees with the plot in Figure 3, where we observe that the 
number of leaves decreases also in a quadratic fashion. Due to the decrease in the size 
(and height) of the tree, time for Phase 1 also decreases, however, at a much slower rate. 
Phase 3, as expected, remains unaffected, and it is equal to a few seconds for all values 
of S and 0. For S < 256KB and 0 > 1.0 the number of leaf entries becomes sufficiently 
small, so that the computational bottleneck of the algorithm becomes Phase 1 . For these 
values the execution time is dominated by the linear scan of the data in Phase 1 . 

We now study the change in the quality measures for the same range of values for 
S and 0. In the extreme cases of S = oo and 0 = 0.0, we only merge identical tuples, 
and no information is lost in Phase 1. LIMBO then reduces to the AIB algorithm, and 
we obtain the same quality as AIB. Figures 4 and 5 show the quality measures for the 
different values of 0 and S. The CU value (not plotted) is equal to 2.51 for S < 256KB, 
and 2.56 for S > 256KB. We observe that for S > 256KB and 0 < 1.0 we obtain 
clusterings of exactly the same quality as for S = oo and 0 = 0.0, that is, the AIB 
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Fig. 3. LIMBO^ Leaves (DS5) 



Fig. 4. LIMBO,,, Quality (DS5) Fig. 5. LIMBOs Quality (DS5) 



algorithm. At the same time, for S = 256KB and <j> = 1.0 the execution time of the 
algorithm is only a small fraction of that of the AIB algorithm, which was a few minutes. 

Similar trends were observed for all other data sets. There is a range of values for 
S', and <i>, where the execution time of LIMBO is dominated by Phase 1, while at the 
same time, we observe essentially no change (up to the third decimal digit) in the quality 
of the clustering. Table 4 shows the reduction in the number of leaf entries for each 
data set for LIMBOs and LIMBO,,. The parameters S and (!) are set so that the cluster 
quality is almost identical to that of AIB (as demonstrated in Table 6). These experiments 
demonstrate that in Phase 1 we can obtain significant compression of the data sets at no 
expense in the final quality. The consistency of LIMBO can be attributed in part to the 
effect of Phase 3, which assigns the tuples to cluster representatives, and hides some of 
the information loss incurred in the previous phases. Thus, it is sufficient for Phase 2 
to discover k well separated representatives. As a result, even for large values of <f> and 
small values of S, LIMBO obtains essentially the same clustering quality as AIB, but in 
linear time. 



Table 4. Reduction in Leaf Entries 





Votes 


Mushroom 


DS5 


DS10 


LIMBO s 


85.94% 


99.34% 


95.36% 


95.28% 


LIMBOs 


94.01% 


99.77% 


98.68% 


98.82% 



5.5 Comparative Evaluations 

In this section, we demonstrate that LIMBO produces clusterings of high quality, and 
we compare against other categorical clustering algorithms. 

Tuple Clustering. Table 5 shows the results for all algorithms on all quality measures 
for the Votes and Mushroom data sets. For LIMBOs, we present results for S = 128K 
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Table 5. Results for real data sets 



Votes (2 clusters) 


Mushroom (2 clusters) 


Algorithm 


size 


I. L(%) 


p 


R 


Emin 


CU 


Algorithm 


size 


IL(%) 


P 


R 


Emin 


CU 


/0=O.O\ 

LMB °(s=oo > 


384 


72.52 


0.89 


0.87 


0.13 


2.89 


/0= 0.0\ 
LIMBO ( 5=qo ) 


8124 


81.45 


0.91 


0.89 


0.11 


1.71 


UMBO s (128KB) 


54 


72.54 


0.89 


0.87 


0.13 


2.89 


LIMBO S (128KB) 


54 


81.46 


0.91 


0.89 


0.11 


1.71 


LIMBO^(l.O) 


23 


72.55 


0.89 


0.87 


0.13 


2.89 


LIMBO^U.O) 


18 


81.45 


0.91 


0.89 


0.11 


1.71 


COOLCAT (s = 435) 


435 


73.55 


0.87 


0.85 


0.15 


2.78 


COOLCAT (s = 1000) 


1,000 


84.57 


0.76 


0.73 


0.27 


1.46 


ROCK(0 = 0.7) 


- 


o 

o 

'rt- 

f'- 


0.87 


0.86 


0.16 


2.63 


ROCK (9 = 0.8) 


- 


86.00 


0.77 


0.57 


0.43 


0.59 



Table 6. Results for synthetic data sets 



DS5 (n=5000, 10 attributes, 5 clusters) 


DS 10 (n=5000, 10 attributes, 10 clusters) 


Algorithm 


size 


IL(%) 


P 


R 


Emin 


CU 


Algorithm 


size 


IL(%) 


P 


R 


Emin 


CU 


-•0=0. (K 
limbo ( 5=oc ) 


o 

o 

o 

IT) 


77.56 


0.998 


0.998 


0.002 


2.56 


-0=0.0'. 
LIMBO \ S=00 ) 


o 

o 

o 

IT) 


73.50 


0.997 


0.997 


0.003 


2.82 


LIMBO s (1024fCB) 


232 


77.57 


0.998 


0.998 


0.002 


2.56 


LIMBO S (1024K’B) 


236 


73.52 


0.996 


0.996 


0.004 


2.82 


LIMBO^(l.O) 


66 


77.56 


0.998 


0.998 


0.002 


2.56 


LIMBO^(l.O) 


59 


73.51 


0.994 


0.996 


0.004 


2.82 


COOLCAT (s = 125) 


125 


78.02 


0.995 


0.995 


0.05 


2.54 


COOLCAT (s = 125) 


125 


74.32 


0.979 


0.973 


0.026 


2.74 


ROCK (0 = 0.0) 


- 


85.00 


0.839 


0.724 


0.28 


0.44 


ROCK (6 = 0.0) 


- 


78.00 


0.830 


0.818 


0.182 


2.13 



while for LIMBO,,, we present results for <f> = 1.0. We can see that both version of 
LIMBO have results almost identical to the quality measures for S' = oo and 0 = 0.0, 
i.e., the AIB algorithm. The size entry in the table holds the number of leaf entries for 
LIMBO, and the sample size for COOLCAT. For the Votes data set, we use the whole 
data set as a sample, while for Mushroom we use 1,000 tuples. As Table 5 indicates, 
LIMBO’s quality is superior to ROCK, and COOLCAT, in both data sets. In terms of IL, 
LIMBO created clusters which retained most of the initial information about the attribute 
values. With respect to the other measures, LIMBO outperforms all other algorithms, 
exhibiting the highest CU, P and R in all data sets tested, as well as the lowest E m i n . 

We also evaluate LIMBO’s performance on two synthetic data sets, namely DS5 and 
DS10. These data sets allow us to evaluate our algorithm on data sets with more than 
two classes. The results are shown in Table 6. We observe again that LIMBO has the 
lowest information loss and produces nearly optimal results with respect to precision 
and recall. 

For the ROCK algorithm, we observed that it is very sensitive to the threshold value 
0 and in many cases, the algorithm produces one giant cluster that includes tuples from 
most classes. This results in poor precision and recall. 

Comparison with COOLCAT. COOLCAT exhibits average clustering quality that is 
close to that of LIMBO. It is interesting to examine how COOLCAT behaves when we 
consider other statistics. In Table 7, we present statistics for 100 runs of COOLCAT 
and LIMBO on different orderings of the Votes and Mushroom data sets. We present 
LIMBO’s results for S = 128KB and 0 = 1.0, which are very similar to those for 
S = oo. For the Votes data set, COOLCAT exhibits information loss as high as 95.31% 
with a variance of 12.25%. For all runs, we use the whole data set as the sample for 
COOLCAT. For the Mushroom data set, the situation is better, but still the variance is 
as high as 3.5%. The sample size was 1,000 for all runs. Table 7 indicates that LIMBO 
behaves in a more stable fashion over different runs (that is, different input orders). 
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Table 7. Statistics for IL(%) and CU 



VOTES 




Min 


Max 


Avg 


\ Var | 


MUSHROOM 




Min 


Max 


Avg 


Var 


LIMBO (S = 128KB) 


I L 


71.98 


73.68 


72.54 


0.08 1 


LIMBO (S = 1024KB) 


IL 


81.46 


81.46 


81.46 


0.00 




CU 


2.80 


2.93 


2.89 


o 

o 

o 

o 

-0 




CU 


1.71 


1.71 


1.71 


0.00 


LIMBO ( <t> = i.o) 


IL 


71.98 


73.29 


72.55 


CO 

oo 

o 

b 


LIMBO ( <t> = i.o) 


IL 


81.45 


81.45 


81.45 


0.00 




CU 


2.83 


2.94 


2.89 


so 

o 

o 

o 

b 




CU 


1.71 


1.71 


1.71 


0.00 


COOLCAT (s = 435 ) 


IL 


71.99 


95.31 


73.55 


12.25 


COOLCAT (s = iooo) 


IL 


81.60 


87.07 


84.57 


3.50 




CU 


0.19 


2.94 


2.78 


0.15 




CU 


0.80 


1.73 


1.46 


0.05 



Notably, for the Mushroom data set, LIMBO’s performance is exactly the same in all 
runs, while for Votes it exhibits a very low variance. This indicates that LIMBO is not 
particularly sensitive to the input order of data. 

The performance of COOLCAT appears to be sensitive to the following factors: the 
choice of representatives, the sample size, and the ordering of the tuples. After detailed 
examination we found that the runs with maximum information loss for the Votes data 
set correspond to cases where an outlier was selected as the initial representative. The 
Votes data set contains three such tuples, which are far from all other tuples, and they are 
naturally picked as representatives. Reducing the sample size, decreases the probability 
of selecting outliers as representatives, however it increases the probability of missing 
one of the clusters. In this case, high information loss may occur if COOLCAT picks 
as representatives two tuples that are not maximally far apart. Finally, there are cases 
where the same representatives may produce different results. As tuples are inserted to 
the clusters, the representatives "move” closer to the inserted tuples, thus making the 
algorithm sensitive to the ordering of the data set. 

In terms of computational complexity both LIMBO and COOLCAT include a stage 
that requires quadratic complexity. For LIMBO this is Phase 2. For COOLCAT, this is 
the step where all pairwise entropies between the tuples in the sample are computed. We 
experimented with both algorithms having the same input size for this phase, i.e., we 
made the sample size of COOLCAT, equal to the number of leaves for LIMBO. Results 
for the Votes and Mushroom data sets are shown in Tables 8 and 9. LIMBO outperforms 
COOLCAT in all runs, for all quality measures even though execution time is essentially 
the same for both algorithms. The two algorithms are closest in quality for the Votes data 
set with input size 27, and farthest apart for the Mushroom data set with input size 275. 
COOLCAT appears to perform better with smaller sample size, while LIMBO remains 
essentially unaffected. 



Web Data. Since this data set has no predetermined cluster labels, we use a different 
evaluation approach. We applied LIMBO with <p = 0.0 and clustered the authorities into 
three clusters. (Due to lack of space the choice of k is discussed in detail in [1].) The total 
information loss was 61%. Figure 6 shows the authority to hub table, after permuting 
the rows so that we group together authorities in the same cluster, and the columns so 
that each hub is assigned to the cluster to which it has the most links. 

LIMBO accurately characterize the structure of the web graph. Authorities are clus- 
tered in three distinct clusters. Authorities in the same cluster share many hubs, while the 
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Table 8. LIMBO vs COOLCAT on Votes 



jj Sample Size = Leaf Entries = 384 | 


Algorithm 


IL{%) 


P 


R 


E ■ 

J-'rntn 


CU 


LIMBO 


72.52 


0.89 


0.87 


0.13 


2.89 


COOLCAT 


74.15 


0.86 


0.84 


0.15 


2.63 


jj Sample Size = Leaf Entries = 27 | 


Algorithm 


IL(%) 


P 


R 


E • 

-‘-'min 


CU 


LIMBO 


72.55 


0.89 


0.87 


0.13 


2.89 


COOLCAT 


73.50 


0.88 


0.86 


0.13 


2.87 



Table 9. LIMBO vs COOLCAT on Mushroom 



j Sample Size = Leaf Entries = 275 j 


Algorithm 


IL(%) 


P 


R 


E ■ 

J-'mm 


CU 


LIMBO 


81.45 


0.91 


0.89 


0.11 


1.71 


COOLCAT 


83.50 


0.76 


0.73 


0.27 


1.46 


j Sample Size = Leaf Entries = 18 | 


Algorithm 


TL(%) 


P 


R 


E ■ 

J-'min 


CU 


LIMBO 


81.45 


0.91 


0.89 


0.11 


1.71 


COOLCAT 


82.10 


0.82 


0.81 


0.19 


1.60 



those in different clusters have very few hubs in common. The three different clusters 
correspond to different viewpoints on the issue of abortion. The first cluster consists of 
“pro-choice” pages. The second cluster consists of “pro-life” pages. The third cluster 
contains a set of pages from cincinnati.com that were included in the data set by 
the algorithm that collects the web pages [5], despite having no apparent relation to the 
abortion query. A complete list of the results can be found in [l]. 7 



Intra-Attribute Value Clustering. We now present results for the application of 
LIMBO to the problem of intra-attribute value clustering. For this experiment, we use 
the Bibliographic data set. We are interested in clustering the conferences and journals, 
as well as the first authors of the papers. We compare LIMBO with STIRR, an algorithm 
for clustering attribute values. 

Following the description of Section 4, for the first experiment we set the random 
variable A' to range over the conferences/journals, while variable A ranges over first 
and second authors, and the year of publication. There are 1,211 distinct venues in the 
data set; 815 are database venues, and 396 are theory venues. 8 Results for S = 5MB 
and <j> = 1.0 are shown in Table 10. LIMBO’s results are superior to those of STIRR 
with respect to all quality measures. The difference is especially pronounced in the P 
and R measures. 



Table 10. Bib clustering using LIMBO & STIRR 



Algorithm 


Leaves 


IL(%) 


P 


R 


E • 

J-'min 


LIMBO (S = 5MB) 


16 


94.02 


0.90 


0.89 


0.12 


LIMBO = 1.0) 


47 


94.01 


0.90 


0.90 


0.11 


STIRR 


- 


98.01 


0.56 


0.55 


0.45 



We now turn to the problem of clustering the first authors. Variable A' ranges over the 
set of 1 ,416 distinct first authors in the data set, and variable A ranges over the rest of the 
attributes. We produce two clusters, and we evaluate the results of LIMBO and STIRR 

7 Available at: http://www.cs.toronto.edu/~periklis/pubs/csrg467.pdf 

8 The data set is pre-classified, so class labels are known. 
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Fig. 6. Web data clusters 



Fig. 7. LIMBO clusters 



Fig. 8. STIRR clusters 



based on the distribution of the papers that were written by first authors in each cluster. 
Figures 7 and 8 illustrate the clusters produced by LIMBO and STIRR, respectively. 
The a; -axis in both figures represents publishing venues while the y-axis represents first 
authors. If an author has published a paper in a particular venue, this is represented by a 
point in each figure. The thick horizontal line separates the clusters of authors, and the 
thick vertical line distinguishes between theory and database venues. Database venues 
lie on the left of the line, while theory ones on the right of the line. 

From these figures, it is apparent that LIMBO yields a better partition of the authors 
than STIRR. The upper half corresponds to a set of theory researchers with almost no 
publications in database venues. The bottom half, corresponds to a set of database re- 
searchers with very few publications in theory venues. Our clustering is slightly smudged 
by the authors between index 400 and 450 that appear to have a number of publications 
in theory. These are drawn in the database cluster due to their co-authors. STIRR, on 
the other hand, creates a well separated theory cluster (upper half), but the second clus- 
ter contains authors with publications almost equally distributed between theory and 
database venues. 



5.6 Scalability Evaluation 

In this section, we study the scalability of LIMBO, and we investigate how the parameters 
affect its execution time. We study the execution time of both LIMBOg and LIMBO,,,. 
We consider four data sets of size 500/v , 1 M, 5 M, and 10 M, each containing 10 clusters 
and 10 attributes with 20 to 40 values each. The first three data sets are samples of the 
10M data set. 

For LIMBOg, the size and the number of leaf entries of the DCF tree, at the end of 
Phase 1 is controlled by the parameter S. For LIMBO,,,, we study Phase 1 in detail. As we 
vary <t>. Figure 9 demonstrates that the execution time for Phase 1 decreases at a steady 
rate for values of 4> up to 1.0. For 1.0 < <f> < 1.5, execution time drops significantly. 
This decrease is due to the reduced number of splits and the decrease in the DCF tree 
size. In the same plot, we show some indicative sizes of the tree demonstrating that the 
vectors that we maintain remain relatively sparse. The average density of the DCF tree 
vectors, i.e.. the average fraction of non-zero entries remains between 41% and 87%. 
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Fig. 9. Phase 1 execution times Fig. 10. Phase 1 leaf entries 



Figure 10 plots the number of leaves as a function of (j>. 9 We observe that for the same 
range of values for <b (1.0 < <j> < 1.5), LIMBO produces a manageable DCF tree, with 
a small number of leaves, leading to fast execution time in Phase 2. Furthermore, in all 
our experiments the height of the tree was never more than 1 1 , and the occupancy of the 
tree, i. e. , the number of occupied entries over the total possible number of entries, was 
always above 85.7%, indicating that the memory space was well used. 

Thus, for 1.0 < 4> < 1.5, we have a DCF tree with manageable size, and fast 
execution time for Phase 1 and 2. For our experiments, we set 4> = 1.2 and <j> = 1.3. 
For LIMBOs, we use buffer sizes of S = 1 MB and S = 5MB. We now study the 
total execution time of the algorithm for these parameter values. The graph in Figure 1 1 
shows the execution time for LIMBOs and LIMBO,, on the data sets we consider. In 
this figure, we observe that execution time scales in a linear fashion with respect to the 
size of the data set for both versions of LIMBO. We also observed that the clustering 
quality remained unaffected for all values of S and 6, and it was the same across the 
data sets (except for IL in the 1 M data set, which differed by 0.01%). Precision (P) and 
Recall ( R ) were 0.999, and the classification error ( E m i n ) was 0.0013, indicating that 
LIMBO can produce clusterings of high quality, even for large data sets. 

In our next experiment, we varied the number of attributes, m, in the 5 M and 10M 
data sets and ran both LIMBOs, with a buffer size of 5 MB, and LIMBO,, with <j> = 1.2. 
Figure 12 shows the execution time as a function number of attributes, for different 
data set sizes. In all cases, execution time increased linearly. Table 1 1 presents the 
quality results for all values of to for both LIMBO algorithms. The quality measures are 
essentially the same for different sizes of the data set. 

Finally, we varied the number of clusters from k = 10 up to k = 50 in the 10M data 
set, for 5 = 5A/ D and <l> — 1.2. As expected from the analysis ofLIMBO in Section 3.4, 
the number of clusters affected only Phase 3. Recall from Figure 2 in Section 5.4 that 
Phase 3 is a small fraction of the total execution time. Indeed, as we increase k from 10 
to 50, we observed just 2.5% increase in the execution time for LIMBOs and just 1.1% 
for LIMBO^. 



9 The y-axis of Figure 10 has a logarithmic scale. 
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Fig. 11. Execution time (m=10) Fig. 12. Execution time 



Table 11. LIMBOs and BIMBO,,, quality 



LIMBO 0 ,5 


IL(%) 


P 


R 


E ■ 


CU 


to = 5 


49.12 


0.991 


0.991 


0.0013 


2.52 


TO = 10 


60.79 


0.999 


0.999 


0.0013 


3.87 


to = 20 


52.01 


0.997 


0.994 


0.0015 


4.56 



6 Other Related Work 

CACTUS, [10], by Ghanti, Gehrke and Ramakrishnan, uses summaries of information 
constructed from the data set that are sufficient for discovering clusters. The algorithm 
defines attribute value clusters with overlapping cluster-projections on any attribute. This 
makes the assignment of tuples to clusters unclear. 

Our approach is based on the Information Bottleneck [IB] Method, introduced by 
Tishby, Pereira and Bialek [20]. The Information Bottleneck method has been used in 
an agglomerative hierarchical clustering algorithm [18] and applied to the clustering of 
documents [19]. Recently, Slonim and Tishby [17] introduced the sequential Information 
Bottleneck, ( sIB ) algorithm, which reduces the running time relative to the agglomerative 
approach. However, it depends on an initial random partition and requires multiple passes 
over the data for different initial partitions. In the future, we plan to experiment with sIB 
in Phase 2 of LIMBO. 

Finally, an algorithm that uses an extension to BIRCH [21] is given by Chiu, Fang, 
Chen, Wand and Jeris [6], Their approach assumes that the data follows a multivariate 
normal distribution. The performance of the algorithm has not been tested on categorical 
data sets. 



7 Conclusions and Future Directions 

We have evaluated the effectiveness of LIMBO in trading off either quality for time or 
quality for space to achieve compact, yet accurate, models for small and large categorical 
data sets. We have shown LIMBO to have advantages over other information theoretic 
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clustering algorithms including AIB (in terms of scalability) and COOLCAT (in terms 
of clustering quality and parameter stability). We have also shown advantages in quality 
over other scalable and non-scalable algorithms designed to cluster either categorical 
tuples or values. With our space-bounded version of LIMBO (LIMBOs), we can build 
a model in one pass over the data in a fixed amount of memory while still effectively 
controlling information loss in the model. These properties make LIMBOs amenable for 
use in clustering streaming categorical data [8] In addition, to the best of our knowledge, 
LIMBO is the only scalable categorical algorithm that is hierarchical. Using its compact 
summary model, LIMBO efficiently builds clusterings for not just a single value of k, 
but for a large range of values (typically hundreds). Furthermore, we are also able to 
produce statistics that let us directly compare clusterings. We are currently formalizing 
the use of such statistics in determining good values for k. Finally, we plan to apply 
LIMBO as a data mining technique to schema discovery [16]. 
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Abstract. With the widespread use of e-business coupled with the pub- 
lic’s awareness of data privacy issues and recent database security related 
legislations, incorporating security features into modern database prod- 
ucts has become an increasingly important topic. Several database ven- 
dors already offer integrated solutions that provide data privacy within 
existing products. However, treating security and privacy issues as an 
afterthought often results in inefficient implementations. Some notable 
RDBMS storage models (such as the N-ary Storage Model) suffer from 
this problem. In this work, we analyze issues in storage security and 
discuss a number of trade-offs between security and efficiency. We then 
propose a new secure storage model and a key management architecture 
which enable efficient cryptographic operations while maintaining a very 
high level of security. We also assess the performance of our proposed 
model by experimenting with a prototype implementation based on the 
well-known TPC-H data set. 



1 Introduction 

Recently intensified concerns about security and privacy of data have prompted 
new legislation and fueled the development of new industry standards. These 
include the Gramm-Leach-Bliley Act (also known as the Financial Modernization 
Act) [3] that protects personal financial information, and the Health Insurance 
Portability and Accountability Act (HIPAA) [4] that regulates the privacy of 
personal health care information. 

Basically, the new legislation requires anyone storing sensitive data to do so 
in encrypted fashion. As a result, database vendors are working towards offering 
security- and privacy-preserving solutions in their product offerings. Two promi- 
nent examples are Oracle [2] and IBM DB2 [5]. Despite its importance, little can 
be found on this topic in the research literature, with the exception of [6], [7] 
and [8]. 

Designing an effective security solution requires, among other things, under- 
standing the points of vulnerability and the attack models. Important issues 
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include: (1) selection of encryption function(s), (2) key management architec- 
ture, and (3) data encryption granularity. The main challenge is to introduce 
security functionality without incurring too much overhead, in terms of both 
performance and storage. The problem is further exacerbated since stored data 
may comprise both sensitive as well as non-sensitive components and access to 
the latter should not be degraded simply because the former must be protected. 

In this paper, we argue that adding privacy as an afterthought results in 
suboptimal performance. Efficient privacy measures require fundamental changes 
to the underlying storage subsystem implementation. We propose such a storage 
model and develop appropriate key management techniques which minimize the 
possibility of key and data compromise. More concretely, our main contribution 
is a new secure DBMS storage model that facilitates efficient implementation. 
Our approach involves grouping sensitive data, in order to minimize the number 
of necessary encryption operations, thus lowering cryptographic overhead. 

Model: We assume a client-server scenario. The client has a combination of sen- 
sitive and non-sensitive data stored in a database at the server, with the sensitive 
data stored in encrypted form. Whether or not the two parties are co-located 
does not make a difference in terms of security. The server’s added responsibil- 
ity is to protect the client’s sensitive data, i.e. , to ensure its confidentiality and 
prevent unauthorized access. (Note that maintaining availability and integrity of 
stored data is an entirely different requirement.) This is accomplished through 
the combination of encryption, authentication and access control. 

Trust in Server: The level of trust in the database server can range from fully 
trusted to fully untrusted, with several intermediate points. In a fully untrusted 
model, the server is not trusted with the client’s cleartext data which it stores. 
(It may still be trusted with data integrity and availability.) Whereas, in a fully 
trusted model, the server essentially acts as a remote (outsourced) database 
storage for its clients. 

Our focus is on environments where server is partially trusted. We consider 
one extreme of fully trusted server neither general nor particularly challeng- 
ing. The other extreme of fully untrusted server corresponds to the so-called 
”Database-as-a-Service” (DAS) model [9]. In this model, a client does not even 
trust the server with cleartext queries; hence, it involves the server perform- 
ing encrypted queries over encrypted data. The DAS model is interesting in its 
own right and presents a number of challenges. However, it also significantly 
complicates query processing at both client and server sides. 



1.1 Potential Vulnerabilities 

Our model has two major points of vulnerability with respect to client’s data: 

— Client-Server Communication: Assuming that client and server are not 
co-located, it is vital to secure their communication since client queries can 
involve sensitive inputs and server’s replies carry confidential information. 
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— Stored Data: Typically, DBMS-s protect stored data through access control 
mechanisms. However, as mentioned above, this is insufficient, since server’s 
secondary storage might not be constantly trusted and, at the very least, 
sensitive data should be stored in encrypted form. 

All client-server communication can be secured through standard means, e.g., 
an SSL connection, which is the current de facto standard for securing Internet 
communication. Therefore, communication security poses no real challenge and 
we ignore it in the remainder of this paper. With regard to the stored data secu- 
rity, although access control has proven to be very useful in today’s databases, 
its goals should not be confused with those of data confidentiality. Our model 
assumes potentially circumvented access control measures, e.g., bulk copying 
of server’s secondary storage. Somewhat surprisingly, there is a dearth of prior 
work on the subject of incorporating cryptographic techniques into databases, 
especially, with the emphasis on efficiency. For this reason, our goal is to come 
up with a database storage model that allows for efficient implementation of 
encryption techniques and, at the same time, protects against certain attacks 
described in the next section. 

1.2 Security and Attack Models 

In our security model, the server’s memory is trusted, which means that an 
adversary can not gain access to data currently in memory, e.g., by performing 
a memory dump. Thus, we focus on protecting secondary storage which, in this 
model, can be compromised. In particular, we need to ensure that an adversary 
who can access (physically or otherwise) server’s secondary storage is unable to 
learn anything about the actual sensitive data. 

Although it seems that, mechanically, data confidentiality is fairly easy to 
obtain in this model, it turns out not be a trivial task. This is chiefly because 
incorporating encryption into existing databases (which are based on today’s 
storage models) is difficult without significant degradation in the overall system 
performance. 

Organization: The rest of the paper is organized as follows: section 2 overviews 
related work and discusses, in detail, the problem we are trying to solve. Section 
3 deals with certain aspects of database encryption, currently offered solutions 
and their limitations. Section 4 outlines the new DBMS storage model. This 
section also discusses encryption of indexes and other database-related opera- 
tions affected by the proposed model. Section 5 consists of experiments with 
our prototype implementation of the new model. The paper concludes with the 
summary and directions for future work in section 6. 



2 Background 

Incorporating encryption into databases seems to be a fairly recent development 
among industry database providers [2] [5], and not much research has been de- 
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voted to this subject in terms of efficient implementation models. A nice survey 
of techniques used by modern database providers can be found in [10]. 

Some recent research focused on providing database as a service (DAS) in an 
untrusted server model [7] [9]. Some of this work dealt with analyzing how data 
can be stored securely at the server so as to allow a client to execute SQL queries 
directly over encrypted tuples. As far as trusted server models, one approach that 
has been investigated involves the use of tamper resistant hardware (smart card 
technology) to perform encryption at the server side [8] . 



2.1 Problems 

The incorporation of encryption within modern DBMS’s has often been incom- 
plete, as several important factors have been neglected. They are as follows: 

Performance Penalty: Added security measures typically introduce significant 
computational overhead to the running time of general database operations. This 
performance penalty is due mainly to the underlying storage models. It seems 
difficult to find an efficient encryption scheme for current database products 
without modifying the way in which records are stored in blocks on disk. The 
effects of the performance overhead encountered by the addition of encryption 
has been demonstrated in [10], where a comparison is performed among queries 
performed on several pairs of identical data sets, one of which contains encrypted 
information while the other does not. 

Inflexibility: Depending on the encryption granularity, it might not be feasible 
to separate sensitive from non-sensitive fields when encrypting. For example, if 
row level encryption is used and only one out of several attributes needs to be 
kept confidential, a considerable amount of computational overhead would be 
incurred due to un- necessary encryption and decryption of all other attributes. 
Obviously, the finer the encryption granularity, the more flexibility is gained 
in terms of selecting the specific attributes to encrypt. (See section 3.2 for a 
discussion of different levels of encryption granularity.) 

Meta data files: Many vendors seem content with being able to claim the ability 
to offer “security” along with their database products. Some of these provide an 
incomplete solution by only allowing for the encryption of actual records, while 
ignoring meta-data and log files which can be used to reveal sensitive fields. 

Unprotected Indexes: Some vendors do not permit encryption of indexes, 
while others allow users to build indexes based on encrypted values. The latter 
approach results in a loss of some of the most obvious characteristics of an index 
- range searches, since a typical encryption algorithm is not order-preserving. 
By not encrypting an index constructed upon a sensitive attribute, such as U.S. 
Social Security Number, record encryption becomes meaningless. (Index encryp- 
tion is discussed in detail in section 4.6.) 
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3 Database Encryption 

There are two well-known classes of encryption algorithms: conventional and 
public-key. Although both can be used to provide data confidentiality, their goals 
and performance differ widely. Conventional, (also known as symmetric-key) 
encryption algorithms require the encryptor and decryptor to share the same 
key. Such algorithms can achieve high bulk encryption speeds, as high as 100-s 
of Mbits/sec. However, they suffer from the problem of secure key distribution , 
i.e., the need to securely deliver the same key to all necessary entities. 

Public-key cryptography solves the problem of key distribution by allowing 
an entity to create its own public/private key-pair. Anyone with the knowledge 
of an entity’s public key can encrypt data for this entity, while only someone 
in possession of the corresponding private key can decrypt the respective data. 
While elegant and useful, public key cryptography typically suffers from slow 
encryption speeds (up to 3 orders of magnitude slower than conventional algo- 
rithms) as well as secure public key distribution and revocation issues. 

To take advantage of their respective benefits and, at the same time, to avoid 
drawbacks, it it usual to bootstrap secure communication by having the parties 
use a public-key algorithm (e.g., RSA [11]) to agree upon a secret key, which is 
then used to secure all subsequent transmission via some efficient conventional 
encryption algorithm, such as AES [12]. 

Due to their clearly superior performance, we use symmetric-key algorithms 
for encryption of data stored at the server. We also note that our particular 
model does not warrant using public key encryption at all. 

3.1 Encryption Modes and Their Side-Effects 

A typical conventional encryption algorithm offers several modes of operation. 
They can be broadly classified as block or stream cipher modes. 

Stream ciphers involve creating a key-stream based on a fixed key (and, 
optionally, counter, previous ciphertext, or previous plaintext) and combining 
it with the plaintext in some way (e.g, by xor-ing them) to obtain ciphertext. 
Decryption involves reversing the process: combining the key-stream with the 
ciphertext to obtain the original plaintext. Along with the initial encryption key, 
additional state information must be maintained (i.e., key-stream initialization 
parameters) so that the key-stream can be re-created for decryption at a later 
time. 

Block ciphers take as input a sequence of fixed-size plaintext blocks (e.g., 
128-bit blocks in AES) and output the corresponding ciphertext block sequence. 
It is usually necessary to pad the plaintext before encryption in order to have 
it align with the desired block size. This can cause certain overhead in terms 
of storage space, resulting in the some data expansion. A chained block cipher 
(CBC) mode is a blend of block and stream modes; in it, a sequence of input 
plaintext blocks is encrypted such that each ciphertext block is dependent on all 
preceding ciphertext blocks and, conversely, influences all subsequent ciphertext 
blocks. 
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We use a block cipher in the CBC mode. Reasons for choosing block (over 
stream) ciphers include the added complexity of implementing stream ciphers, 
specifically, avoiding re-use of key-streams. This complexity stems from the dy- 
namic nature of the stored data: the contents of data pages may be updated fre- 
quently, requiring the use of a new key-stream. In order to remedy this problem, 
a certain amount of state would be needed to help create appropriate distinct 
key-streams whenever stored data is modified. 



3.2 Encryption Granularity 

Encryption can be performed at various levels of granularity. In general, finer 
encryption granularity affords more flexibility in allowing the server to choose 
what data to encrypt. This is important since stored data may include non- 
sensitive fields which, ideally, should not be encrypted (if for no other reason 
than to reduce overhead). The obvious encryption granularity choices are: 

— Attribute value: smallest achievable granularity; each attribute value of a 
tuple is encrypted separately. 

— Record/row: each row in a table is encrypted separately. This way, if only 
certain tuples need to be retrieved and their locations in storage are known, 
the entire table need not be decrypted. 

— Attribute/column: a more selective approach whereby only certain sensi- 
tive attributes (e.g., credit card numbers) are encrypted. 

— Page/block: this approach is geared for automating the encryption process. 
Whenever a page/block of sensitive data is stored on disk, the entire block is 
encrypted. One such block might contain one or multiple tuples, depending 
on the number of tuples fitting into a page (a typical page is 16 Kbytes). 

As mentioned above, we need to avoid encrypting non-sensitive data. If a record 
contains only a few sensitive fields, it would be wasteful to use row- or page-level 
encryption. However, if the entire table must be encrypted, it would be advanta- 
geous to work at the page level. This is because encrypting fewer large pieces of 
data is always considerably more efficient than encrypting several smaller pieces. 
Indeed, this is supported by our experimental results in section 3.6. 



3.3 Key Management 

Key management is clearly a very important aspect of any secure storage model. 
We use a simple key management scheme based on a two-level hierarchy consist- 
ing of a single master key and multiple sub-keys. Sub-keys are associated with 
individual tables or pages and are used to encrypt the data therein. Generation 
of all keys is the responsibility of the database server. Each sub- key is encrypted 
under the master key. Certain precautions need to be taken in the event that 
the master key is (or is believed to be) compromised. In particular, re-keying 
strategies must be specified. 
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3.4 Re-keying and Re-encryption 

There are two types of re-keying: periodic and emergency. The former is needed 
since it is generally considered good practice to periodically change data en- 
cryption keys, especially, for data stored over a long term. Folklore has it, that 
the benefit of periodic re-keying is to prevent potential key compromise. How- 
ever, this is not the case in our setting, since an adversary can always copy the 
encrypted database from untrusted secondary storage and compromise keys at 
some (much) later point, via, e.g., a brute-force attack. 

Emergency re-keying is done whenever key compromise is suspected or ex- 
pected. For example, if a trusted employee (e.g., a DBA) who has access to 
encryption keys is about to be fired or re-assigned, the risk of this employee 
mis-using the keys must be considered. Consequently, to prevent potential com- 
promise, all affected keys should be changed before (or at the time of) employee 
termination or re-assignment. 

3.5 Key Storage 

Clearly, where and how the master key is stored influences the overall security 
of the system. The master key needs to be in possession of the DBA, stored on 
a smart card or some other hardware device or token. Presumably, this device 
is somehow “connected” to the database server during normal operation. How- 
ever, it is then possible for a DBA to abscond with the master key or somehow 
leak it. This should trigger emergency re-keying, whereby a new master key is 
created and all keys previously encrypted under the old master key are updated 
accordingly. 



3.6 Encryption Costs 

Advances in general processor and DSP design continuously yield faster en- 
cryption speeds. However, even though bulk encryption rates can be very high, 
there remains a constant start-up cost associated with each encryption operation. 
(This cost is especially noticeable when keys are changed between encryptions 
since many ciphers require computing a key schedule before actually performing 
encryption.) The start-up cost dominates overall processing time when small 
amounts of data are encrypted, e.g., individual records or attribute values. 

Experiments: Recall our earlier claim that encrypting the same amount of 
data using few encryption operations with large data units is more efficient than 
many operations with small data units. Although this claim is quite intuitive, 
we still elected to run an experiment to support it. The experiment consisted of 
encrypting 10 Mbytes using both large and small unit sizes: blocks of 100-, 120-, 
and 16K-bytes. The two smaller data units represent average sizes for records 
in the TPC-H data set [13], while the last unit of 16-Kbytes was chosen as it is 
the default page size used in MySQL’s InnoDB table type. We used MySQL to 
implement our proposed storage model; the details can be found in section 4. 
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Table 1. Many small vs. few large data blocks encrypted. All times in msec include 
initialization and encryption cost. 



Encryption Alg 


100 Bytes 
* 100,000 


120 Bytes 
* 83,3333 


16 KBytes 
* 625 


AES 


365 


334 


194 


DES 


372 


354 


229 


Blowfish 


5280 


4409 


170 



We then performed the following operations: 100, 000 encryptions of the 100- 
byte unit, 83,333 encryptions of the 120-byte unit, and 625 encryptions of the 
16-Kbyte unit. Our hardware platform was a Linux box with a 2.8 Ghz PIV 
with 1-Gbyte of RAM. Cryptographic software support was derived from the 
well-known OpenSSL library [14]. We used the following three ciphers: DES 
[15], Blowfish [16], and AES [12], of which the first two operate on 8-byte data 
blocks, while AES uses 16-byte blocks. Measurements for encrypting 10-Mbytes, 
including the initialization cost associated with each invocation of the encryption 
algorithms, are are shown in Table 1. 

As pointed out earlier, a constant start-up cost is associated with each algo- 
rithm. This cost becomes significant when invoking the cipher multiple times. 
Blowfish is the fastest of the three in terms of sheer encryption speed, however, 
it also incurs the highest start-up cost. This is clearly illustrated in the measure- 
ments of the encryption of the small data units. All algorithms display reduced 
encryption costs when 16-Kbyte blocks are used. 

The main conclusion we draw from these results is that encrypting the same 
amount of data using fewer large blocks is clearly more efficient than using several 
smaller blocks. The cost difference is due mainly to the start-up cost associated 
with the initialization of the encryption algorithms. It is thus clearly advanta- 
geous to minimize the total number of encryption operations, while ensuring 
that input data matches up with the encryption algorithm’s block size (in order 
to minimize padding). One obvious way is to cluster sensitive data which needs 
to be encrypted. This is, in fact, a feature of the new storage model described 
in section 4.2. 



4 Partition Plaintext and Ciphertext (PPC) Model 

The majority of today’s database systems use the N-ary Storage Model (NSM) 
[17] which we now describe. 

4.1 N-ary Storage Model (NSM) 

NSM stores records from a database continuously starting at the beginning of 
each page. An offset table is used at the end of the page to locate the beginning of 
each record. NSM is optimized for transferring data to and from secondary stor- 
age and offers excellent performance when the query workload is highly selective 
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Fig. 1 . NSM structure for our sample relation. 



and involves most record attributes. It is also popular since it is well-suited for 
online transaction processing; more so than another prominent storage model, 
the Decomposed Storage Model (DSM) [1], 

NSM and Encryption: Even though NSM has been a very successful RDBMS 
storage model, it is rather ill-suited for incorporating encryption. This is espe- 
cially the case when a record has both sensitive and non-sensitive attributes. 
We will demonstrate, via an example scenario, exactly how the computation 
and storage overheads are severely increased when encryption is used within 
the NSM model. We assume a sample relation that has four attribute values: 
EmpNo, Name, Department, and Salary. Of these, only Name and Salary are 
sensitive and must be encrypted. Figure 1 shows the NSM record structure. 

Since only two attributes are sensitive, we would encrypt at the attribute 
level so as to avoid unnecessary encryption of non-sensitive data (see section 
3.2). Consequently, we need one encryption operation for each attribute- value. 1 

As described in section 3.1, using a symmetric-key algorithm in block cipher 
mode requires padding the input to match the block size. This can result in signif- 
icant overhead when encrypting multiple values, each needing a certain amount 
of padding. For example, since AES [12] uses 16-byte input blocks, encryption 
of a 2-byte attribute value would require 14 bytes of padding. 

To reduce thes costs outlined above, we must avoid small non-continuous 
sensitive plaintext values. Instead, we need to cluster them in some way, thereby 
reducing the number of encryption operations. Another potential benefit would 
be reduced amount of padding: per cluster, as opposed to per attribute value. 

Optimized NSM: Since using encryption in NSM is quite costly, we suggest 
an obvious optimization. It involves storing all encrypted attribute values of 
one record sequentially (and, similarly, all plaintext values). With this optimiza- 
tion, a record ends up consisting of two parts: the ciphertext attributes followed 
by the plaintext (non-sensitive) attributes. The optimized version of NSM re- 
duces padding overhead and eliminates multiple encryptions operations within a 
record. However, each record is still stored individually, meaning that, for each 
record, one encryption operation is needed. Moreover, each record is padded 
individually. 

1 If we instead encrypted at record or page level, non-sensitive attributes Empld and 
Department would be also encrypted, thus requiring additional encryption opera- 
tions. Even worse, for selection queries that only involve non-sensitive attributes, 
the cost of decrypting the data would still apply. 
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4.2 Partition Plaintext Ciphertext Model (PPC) 

Our approach is to cluster encrypted data while retaining NSM’a benefits. We 
note that, recently, Partition Attribute Across (PAX) model was proposed as an 
alternative to NSM. It involves partitioning a page into mini-pages to improve 
upon cache performance [18]. Each mini-page represents one attribute in the 
relation. It contains the value for this attribute of each record stored in the page. 
Our model, referred to as Partition Plaintext and Ciphertext (PPC), employs an 
idea similar to that of PAX, in that pages are split into two mini-pages, based 
on plaintext and ciphertext attributes, respectively. Each record is likewise split 
into two sub-records. 

PPC Overview: The primary motivation for PPC is to reduce encryption costs, 
including computation and storage costs, while keeping the NSM storage schema. 
We thus take advantage of NSM while enabling efficient encryption. Implement- 
ing PPC on existing DBMS’s that use NSM requires only a few modifications to 
page layout. PPC stores the same number of records on each page as does NSM. 

Within a page, PPC vertically partitions a record into two sub-records, one 
of which contains the plaintext, while the other - ciphertext, attributes. Both 
sub-records are organized in the same manner as NSM records. PPC stores all 
plaintext sub-records in the first part of the page, which we call a plaintext mini- 
page. The second part of the page stores a ciphertext mini-page. Each mini-page 
has the same structure as a regular NSM page and records within two mini-pages 
are stored in the same relative order. At the end of each mini-page is an offset 
table pointing to the end of each sub-record. Thus, a PPC page can be viewed 
as two NSM mini-pages. Specifically, if a page does not contain any ciphertext, 
PPC layout is identical to NSM. Current database systems using NSM would 
only need to change the way they access pages in order to incorporate our PPC 
model. 

Figure 2 shows an example of a PPC page. In it, three records are stored 
within a page. The plaintext mini-page contains non-sensitive attribute values 
EmpNo and Department and the ciphertext mini-page stores encrypted Name 
and Salary attributes. The advantage of encryption at the mini-page level can be 
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seen by observing that only one encryption operation is needed per page, and, of 
course, only one decryption is required when the page is brought into memory 
and any sensitive attributes are accessed. 

The PPC page header contains two mini-page pointers (in addition to the 
typical page header fields) which include the starting addresses for plaintext and 
ciphertext mini-pages. 

Buffer Manager Support: When a PPC page is brought into a buffer slot, 
depending upon the nature of the access, the page may first need to be decrypted. 
The buffer manager, besides supporting a write bit to indicate whether a page 
has been modified, also needs to support an encryption bit to determine whether 
the ciphertext mini-page has already been decrypted. Initially, when the page is 
brought into a buffer slot, its write bit is off, and its encryption bit is on. Each 
read or update request to the buffer manager indicates whether a sensitive field 
needs to be accessed. 

The buffer manager processes read requests in the obvious manner: if the 
encryption bit is on, it requests the mini-page to be decrypted and then resets 
the encryption bit. If the bit is off, the page is already in memory in plaintext 
form. Record insertion, deletion and update are also obvious: the write bit is set, 
while the encryption bit is only modified if any ciphertext has to be decrypted. 

Whenever a page in the buffer is chosen by the page replacement policy to be 
sent to secondary storage, the buffer manager first checks if the page’s encryption 
bit is off and the write bit is on. If so, the cipher mini-page is first re-encrypted 
before being stored. 



4.3 Analysis and Comparisons of the PPC Model 

We now compare NSM with the proposed PPC model. As will be seen below, 
PPC outperforms NSM irrespective of the encryption level of granularity in 
NSM. Two main advantages of PPC are: 1) considerably fewer encryption oper- 
ations due to clustering of sensitive data, and 2) overhead for queries involving 
only non-sensitive attributes. 

NSM with attribute-level encryption: The comparison is quite straightfor- 
ward. NSM with attribute-level encryption requires as many encryption oper- 
ations per record as there are sensitive attributes. Records are typically small 
enough such that a large number can fit into one page, resulting in a large num- 
ber of encryption operations per page. As already stated, only one decryption is 
per page is needed in the PPC model. 

NSM with record-level encryption: One encryption is required for each 
record in the page. PPC requires only one encryption per page. 

NSM with page level encryption: one encryption per page is required in 
both models. The only difference is for queries involving both sensitive and non- 
sensitive attributes. NSM performs an encryption operation regardless of the 
types of attributes, whereas, PPC only encrypts as necessary. 




158 B. Iyer et al. 



Optimized NSM: We mentioned the optimized NSM model in section 1. It is 
similar to NSM with record-level encryption (each record requires one encryp- 
tion) . It only differs from NSM in that extra overhead is incurred for non-sensitive 
queries. Again, PPC requires only one encryption operation per page as opposed 
to one per record for the optimized NSM. 

Note that, for each comparison, the mode of access is irrelevant, i.e., whether 
records within the page are accessed sequentially or randomly, as is the case 
with the DSS and OLTP workloads. For every implementation, other than NSM 
with page- level encryption, one still needs to access and perform an encryption 
operation on the record within the page, regardless of how the record is accessed. 

From the above discussion, we establish that PPC has the same costs as 
the regular non-encrypted NSM when handling non-sensitive queries. On the 
other hand, when sensitive attributes are involved, PPC costs a single encryption 
operation. We thus conclude that PPC is superior to all NSM variants in terms 
of encryption-related computation overhead. 

Although we have not yet performed a detailed comparison of storage over- 
heads (due to padding), we claim that PPC requires the same (or less) amount 
of space than any NSM variant. In the most extreme case, encrypting at the 
attribute level requires each sensitive attribute to be padded. A block cipher 
operating on 128-bit blocks (e.g., AES) would, on the average, add 64 bits of 
padding to each encrypted unit. Only encrypting at page level would minimize 
padding overhead, since only a single one unit is encrypted. This is the case for 
both NSM with page-level encryption and PPC. 

4.4 Database Operations in PPC 

Basic database operations (insertion, deletion, update and scan) in PPC are 
implemented similar to their counterparts in NSM, since each record in PPC is 
stored as two sub-records conforming to the NSM structure. During insertion, 
a record is split into two sub-records and each is inserted into its corresponding 
mini-page (sensitive or non-sensitive). As described in Section 4.2, the buffer 
manager determines when the ciphertext mini-page needs to be en/de-crypted. 
Implementation of deletion and update operations is straight-forward. 

When running a query, two scan operators are invoked, one for each mini- 
page. Each scan operator sequentially reads a sub-record in the corresponding 
mini-page. If the predicate associated with the scan operator refers to an en- 
crypted attribute, the scan operator indicates in its request to the buffer man- 
ager that it will be accessing an encrypted attribute. Scans could be implemented 
either using sequential access to the file containing the table, or using an index. 
Indexing on encrypted attributes is discussed in section 4.6 below. 

4.5 Schema Change 

PPC stores the schema of each relation in the catalog file. Upon adding or 
deleting an attribute, PPC creates a new schema and assigns it a unique version 
ID. All schema versions are stored in the catalog file. The header in the beginning 
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of each plaintext and ciphertext sub-record contains the schema version that it 
conforms to. The advantage of having schemas in both plaintext and ciphertext 
sub-records is that, when retrieving a sub-record, only one lookup is needed to 
determine the schema that a record conforms to. 

Adding or deleting an attribute is handled similar to NSM, with two ex- 
ceptions: (1) a previously non-sensitive attribute is changed to sensitive (i.e., it 
needs to be encrypted), and (2) a sensitive attribute is changed to non-sensitive, 
i.e., it needs to be stored as plaintext. 

To handle the former, a new schema is first created and assigned a new 
version ID. Then, all records containing this attribute are updated according to 
the new schema. This operation can be executed in a lazy fashion (pages are 
read and translated asynchronously), or synchronously, in which case the entire 
table is locked and other transactions prevented from accessing the table until 
the reorganization is completed. 

Changing an attribute’s status from sensitive to non-sensitive can be deferred 
(for bulk decryption) or done in a lazy fashion, since it is generally not urgent 
to physically peform all decryption at once. For each accessed page, the schema 
comparison operation will indicate whether a change is necessary. At that time, 
the attribute will be decrypted and moved from the ciphertext to the plaintext 
mini-page. 



4.6 Encrypted Index 

Index data structures are crucial components of any database system, mainly to 
provide efficient range and selection queries. We need to assess potential impact 
of encryption on the performance of the index. Note that an index built upon a 
sensitive attribute must be encrypted, since it contains attribute values present 
in the actual relation. 

There are basically two approaches to building an index based on sensitive 
attribute values. In the first, the index is based upon ciphertext, while, in the 
second, the intermediate index is based upon plaintext and the final index is 
obtained by encrypting the intermediate index. Based upon characteristics which 
make encryption efficient in PPC, we choose to encrypt at page level, thereby 
encrypting each node independently. Whenever a specific index is needed in the 
processing of a query, necessary parts of the data structure are brought into 
memory and decrypted. This approach provides full index functionality while 
keeping the index itself secure. 

There are certain tradeoffs associated with either of the two approaches. Since 
encryption does not preserve order, the first approach is infeasible when index is 
used to process range queries. Note that exact-match selection queries are still 
possible if encryption is deterministic. 2 When searching for an attribute value, 
the value is simply encrypted under the same encryption key as used in the index 

2 Informally, if encryption is non-deterministic (randomized), the same cleartext en- 
crypted twice yields two different ciphertexts. This is common practice for preventing 
ciphertext correlation and dictionary attacks. 
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before the search. The second approach, in contrast, can support range queries 
efficiently, albeit, with the additional overhead of decrypting index pages. 

Given the above tradeoff, the first strategy appears preferable for hash-indices 
or B-tree indices over record identifiers and foreign keys where data access is ex- 
pected to be over equality constraints (and not a range). The second strategy 
is better when creating B-trees where access may include range queries. Since 
the second strategy incurs additional encryption overhead, we discuss its perfor- 
mance in further detail. 

When utilizing a B-tree index, we can assume that plaintext representation of 
the root node already resides in memory. With a tree of depth two, we only need 
to perform two I/O operations for a selection: one each for the node at level 1 
and 2, and, correspondingly, two decryption operations. As described in section 
3.6, we measure encryption overhead in the number of encryption operations. 
Since one decryption is needed for each accessed node, the total overhead is 
the number of accessed nodes multiplied by the start-up cost of the underlying 
encryption algorithm. 



5 Experiments 

We created a prototype implementation of the PPC model based on MySQL 
version 4.1.0-alplra. This version provides all the necessary DBMS components. 
We modified the InnoDB storage model, by altering its page and record struc- 
ture to create plaintext and ciphertext mini-pages. We utilized the OpenSSL 
cryptographic library as the basis for our encryption-related code. For the ac- 
tual measurements we used the Blowfislr encryption algorithm. The experiments 
were conducted on a 1.8 Ghz PIV machine with 384MB of RAM running Win- 
dows XP. 



5.1 PPC Details 

Each page in InnoDB is, by default, 16KB. Depending on the number of en- 
crypted attributes to be stored, we split the existing pages into two parts to 
accommodate the respective plaintext and ciphertext mini-pages. Partitioning 
of one record into plaintext and ciphertext sub-records only takes place when the 
record is written to its designated page. InnoDB record manager was modified 
to partition records and store them in their corresponding mini-pages. 

If a record must be read from disk during a query execution, it is first ac- 
cessed by InnoDB which converts it to MySQL record format. We modified this 
conversion in the record manager to determine whether the ciphertext part of the 
record is needed in the current query. If so, the respective ciphertext mini-page 
is first decrypted before the ciphertext and plaintext sub-records are combined 
into a regular MySQL record. The ciphertext mini-page is re-encrypted if and 
when it is written back to disk. 
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5.2 Overview of Experiments 

For each experiment, we compared running times between NSM with no encryp- 
tion (NSM), NSM with page level encryption (NSM-page), and PPC (PPC). Our 
goal was to verify that PPC would outperform NSM-page and, when no sensitive 
values are involved in a query, would perform similar to NSM. Lastly, when all 
attributes involved are encrypted, we expect PPC and NSM-page to perform 
roughly equally. 

As claimed earlier, PPC handles mixed queries 3 very well. To support this, 
we created a schema where at least one attribute from each table - and 
more than one in larger tables (2 attributes in lineitem and 3 in orders) - 
was encrypted. Specifically, we encrypted the following attributes: L partkey, 
Lshipmode, p-name, S-acctbal, ps supply cost, cmame, O-orderdate, osustkey, 
O-totalprice, n-name, and r-name. 

We ran three experiments, all based on the TPC-H data set. First, we 
measured the loading time during bulk insertion of a 100-, 200-, and 500-MB 
database. Our second experiment consisted of running TPC-H query number 1, 
which is based only on the lineitem table, while varying the number of encrypted 
attributes involved in the query, in order to analyze PPC performance. Finally, 
we compared query response time of a sub-set of TPC-H queries, attempting to 
identify and highlight the properties of the PPC scheme. 



5.3 Bulk Insertion 

We compared bulk insertion loading time required for a 100-, 200-, and 500-MB 
TPC-H database for each of the three models. The schema described in section 
5.2 was used to define the attributes to be encrypted in PPC. On the average, 
NSM-page and PPC incur a 24% and 15% overhead, respectively, in loading time 
as compared to plain NSM. As expected, PPC outperforms NSM-page since less 
data is encrypted (not all attributes are considered sensitive) . 

We note that a more accurate PPC implementation would perform some page 
reorganization if the relations contained variable length attributes, in order to 
make best use of space in the mini-pages. As stated in [18], the PAX model 
suffers a 2-10% performance penalty due to page reorganization, depending on 
the desired degree of space utilization in each mini-page. However, the PAX 
model creates as many mini-pages as there are records, while we always create 
only two. 

5.4 Varying Number of Encrypted Attributes 

In this experiment, we ran queries within a single table to analyze PPC perfor- 
mance while increasing the number of encrypted attributes. Query 1, which only 
involves table lineitem, was chosen for this purpose, and executed over a 200MB 
database. It contains 16 attributes, 7 of which are involved in the query. We then 

3 Queries involving both sensitive and non-sensitive attributes. 
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Fig. 3. Varying the number of encrypted attributes in TPC-H Query 1. PPC-(a:,t/) 
indicates that y attributes in lineitem were encrypted with x of them appearing in the 
query 



computed the running time for NSM and NSM-page, both of which remained 
constant as the number of encrypted attributes varied. Eight different instances 
of PPC were run: the first having no encrypted attributes and the last having 
14 (we did not encrypt the attributes used as primary keys). Figure 3 illustrates 
our results. 

The results clearly show that the overhead incurred when encrypting addi- 
tional attributes in PPC is rather minimal. As expected, PPC with no (or only a 
few) encrypted attributes exhibits performance almost identical to that of NSM. 
Also, as more attributes are encrypted, PPC performance begins to resemble 
that of NSM-page. Two last instances of PPC have relatively longer query ex- 
ecution times. This is due to encryption of the lineitem variables L shipinstruct 
and Lcommentj which are considerably larger than any other in the table. 

5.5 TPC-H Queries 

Recall that PPC and NSM-page each require only one encryption operation 
per page. However, NSM-page executes this operation whether or not there are 
encrypted attributes in the query. In the following experiment, we attempted to 
exploit the advantages of PPC, by comparing its performance with NSM-page 
when executing a chosen subset of TPC-H queries on a 200MB database. As in 
the bulk insertion experiment, we utilized a pre-defined schema (see section 5.2) 
to determine the attributes to encrypt when using PPC. 

In figure 4, we refer to individual queries to highlight some of the interesting 
observations from the experiment. Queries 1 and 6 are range queries and, accord- 
ing to our PPC schema, have no sensitive attributes involved. The running time 
of NSM and PPC are, as expected, almost identical. However, since the tables in- 
volved contain encrypted attributes, NSM-page suffers from over-encryption and 
consequently has to decrypt records involved in the query, thereby adding to its 
running time. Due to the simplicity of these queries, the NSM-page encryption 
overhead is relatively small. 
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Fig. 4. Comparison of running times from selected TPC-H queries over a 200 MB 
database for NSM, PPC, and NSM-page. 



Query 8 involved encrypted attributes from each of the three largest tables 
( lineitem , partsupp, orders ), causing both PPC and NSM-page to incur rela- 
tively significant encryption-caused overhead. Again, PPC outperforms NSM- 
page since it has to decrypt less data. In contrast, query 10 involves encrypted 
attributes from four tables, only one of which is large (table orders). In this case, 
PPC performs well as compared to NSM-page, as the latter needs to decrypt 
lineitem in addition to other tables involved. 

Overall, based on 10 queries shown in figure 4, PPC and NSM-page incur 
6% and 33% overhead, respectively, in query response time. We feel that this 
experiment illustrated PPC’s superior performance for queries involving both 
sensitive and non-sensitive attributes. 

6 Conclusion 

In this paper, we proposed a new DBMS storage model (PPC) that facilitates 
efficient incorporation of encryption. Our approach is based on grouping sensitive 
data in order to minimize the number of encryption operations, thus, greatly 
reducing encryption overhead. We compared and contrasted PPC with NSM 
and discussed a number of important issues regarding storage, access and query 
processing. Our experiments clearly illustrate advantages of the proposed PPC 
model. 
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Abstract. In this paper, we study the safety guarantees of group communication- 
based database replication techniques. We show that there is a model mismatch 
between group communication and database, and because of this, classical group 
communication systems cannot be used to build 2-safe database replication. We 
propose a new group communication primitive called end-to-end atomic broadcast 
that solves the problem, i.e., can be used to implement 2-safe database replication. 
We also introduce a new safety criterion, called group-safety, that has advantages 
both over 1-safety and 2-safety. Experimental results show the gain of efficiency 
of group-safety over lazy replication, which ensures only 1 -safety. 



1 Introduction 

Database systems represent an important aspect of any IT infrastructure and as such 
require high availability. Software-based database replication is an interesting option 
because it promises increased availability at low cost. Traditional database replication is 
usually presented as a trade-off between performance and consistency [ 1 ] , i.e., between 
eager and lazy replication. Eager replication, based on an atomic commitment protocol, 
is slow and deadlock prone. Lazy replication, which foregoes the atomic commitment 
protocol, can introduce inconsistencies, even in the absence of failures. 

However, eager replication does not need to be based on atomic commitment. A 
different approach, which relies on group communication primitives to abstract the net- 
work functionality, has been proposed in [2,3], These techniques typically use an atomic 
broadcast primitive (also called total order broadcast) to deliver and order transactions 
in the same serial order on all replicas, and offer an answer to many problems of eager 
replication without the drawbacks of lazy replication: they offer good performance [4], 
use the network more efficiently [5] and also reduce the number of deadlocks [6], 

Conceptually, group communication-based data replication systems are built by com- 
bining two modules: (1) a database module, which handles transactions and (2) a group 
communication module, which handles communication. When combined, these two 
module result in a replicated database. However, the two modules assume different fail- 
ure models, which means that the failure semantics of the resulting system are unclear. 

In this paper, we examine the fault tolerance guarantees offered by database repli- 
cation techniques based on group communication. The model mismatch between group 
communication and database systems comes from the fact that they originate from two 
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different communities. We explore this mismatch from two point of views: from the 
database point of view, and from the distributed system point of view. Database repli- 
cation is usually specified with the 1 -safety and 2-safety criteria. The first offers good 
performance, the second strong safety. However, group communication as currently 
specified, cannot be used to implement 2-safe database replication. The paper shows 
how this can be corrected. Moreover, we show that the 1 -safety and 2-safety criteria 
can advantageously be replaced by a new safety criterion, which we call group-safety. 
Group safety ensures ensures that the databases say consistent as long as the number of 
server crashes are bounded. While this notion is natural for group communication, it is 
not for replicated databases. Simulation result show that group-safe database replication 
leads to improved performance over 1 -safety, while at the same time offering stronger 
guarantees. 

The rest of the paper is structured as follows. Section 2 presents the model for the 
database system and for group communication, and explains the use of group commu- 
nication (more specifically atomic broadcast) for database replication. Section 3 shows 
that this solution, based on current specification of atomic broadcast, cannot be 2-safe. 
Section 4 proposes a new specification for atomic broadcast, in order to achieve 2- 
safety. Section 5 defines the new safety criterion called group-safety. Section 6 compares 
the efficiency of group-safe replication and 1-safe replication by simulation. Section 7 
discusses the relationship between group-safe replication and lazy replication. Finally 
Sect. 8 presents related work and Sect. 9 concludes the paper. 



2 Model and Definitions 

We assume that the overall system is built from three components (Fig. 1): the database 
component, the group communication component and the replicated database compo- 
nent. The first two components offer the infrastructure needed to build the application - 
in our case a replicated database. These two infrastructure components are accessed by 
the application, but they have no direct interaction with each other. 

The replicated database component implements the actual replicated database and is 
described in Sect. 2. 1 . The database component contains all the facilities to store the data 
and execute transactions locally, and is described in Sect. 2.2. The group communication 
component offers broadcast primitives, in particular atomic broadcast, and is described 
in Sect. 2.3). 
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2.1 Database Replication Component 

The database replication component is modelled as follows. We assume a set of servers 
Si,..., S n , and a fully replicated database D = { l)\ . . . D n } , where each server Si 
holds a copy Di of the database. Since group communication does not make much 
sense with a replication degree of 2, we consider that n > 3. We assume update- 
every where replication [1]: clients can submit transactions to any server S, . Clients 
wanting to execute transaction t send it to one server S,i that will act as the delegate 
for this transaction: Sd is responsible for executing the transaction and sending back 
the results to the client. 1 The correctness criterion for the replicated database is one- 
copy serialisability: the system appears to the outside world as one single non-replicated 
database. 



Replication Scheme. A detailed discussion of the different database replication tech- 
niques appears in [7], Among these techniques, we consider those that use group com- 
munication, e.g., atomic broadcast (see Sect. 2.3). As a representative, we consider the 
technique called update-everywhere, non-voting, single network interaction. Fig. 2 il- 
lustrates this technique. 2 The technique is called non-voting because there is no voting 
phase in the protocol to ensure that all servers commit or abort the transaction: this 
property is ensured by the atomic broadcast group communication primitive. 

The processing of transaction t is done in the following way. The client C sends 
the transaction to the delegate server Sd- The delegate processes the transaction, and, 
if it contains some write operations, broadcasts the transaction to all servers using an 
atomic broadcast. All servers apply the writes according to delivery order of the atomic 
broadcast. Conflicts are detected deterministically and so, if a transaction needs to be 
aborted, it is aborted on all servers. Techniques that fit in this category are described 
in [8,9,10,4,11], 



Safety Criteria for Replicated Databases. There are three safety criteria for replicated 
database, called 1-safe, 2-safe and very safe [12]. When a client receives a message 

1 The role of the delegate is conceptually the same than the primary server, simply any server 
can acts as a “primary”. 

2 However, the results in this paper apply as well to the other techniques in [7] based on group 
communication. 
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indicating that his transaction committed, it means different things depending on the 
safety criterion. 

1- safe: If the technique is 1 -safe, when the client receives the notification of t’s commit, 

then t has been logged and will eventually commit on the delegate server of t. 

2- safe: If the technique is 2-safe, when the client receives the notification of t’s commit, 

then t is guaranteed to have been logged on all available servers, and thus will 
eventually commit on all available servers. 

Very safe: If the technique is very safe, when the client receives the notification of 
t’s commit, then t is guaranteed to have been logged on all servers, and thus will 
eventually commit on all servers. 

Each safety criterion shows a different tradeoff between safety and availability: the 
more safe a system, the less available it is. 1-safe replication ensures that transactions can 
be accepted and committed even if only one server is available: synchronisation between 
copies is done outside of the scope of the transaction’s execution. So a transaction can 
commit on the delegate server even if all other servers are unavailable. On the other 
hand, 1-safe replication schemes can lose transactions in case of a crash. A very safe 
system ensures that a transaction is committed on all servers, but this means that a single 
crash renders the system unavailable. This last criterion is not very practical and most 
systems are therefore 1-safe or 2-safe. 

The distinction between 1 -safe and 2-safe replication is important. If the technique 
is 1-safe, transactions might get lost if one server crashes and another takes over, i.e., 
the durability part of the ACID properties is not ensured. If the technique is 2-safe, no 
transaction can get lost, even if all servers crash. 



2.2 Database Component 

We assume a database component on each node of the system. Each database component 
hosts a full copy of the database. The database component executes local transactions 
and enforces the ACID properties (in particular serialisability) locally. 

We also assume that the local database component offers all the facilities and guar- 
antees needed by the database replication technique (see [7]), and has a mechanism to 
detect and handle transactions that are submitted multiple times, e.g., testable transac- 
tions [13]. 



2.3 Group Communication Component 

Each server Si hosts one process •p l , which implements the group communication com- 
ponent. While the database model is quite well established and agreed upon, there is a 
large variety of group communication models [14]. Considering the context of the pa- 
per, we mention two of them. The first model is the dynamic crash no-recovery model, 
which is assumed by most group communication implementations. The other model is 
the static crash-recovery model, which has been described in the literature, but has seen 
little use in actual group communication infrastructure. 
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Dynamic crash no-recovery model. The dynamic crash no-recovery model has been 
introduced in the Isis system [15], and is also sometimes called the view based model. In 
this model, the group is dynamic : processes can join and leave after the beginning of the 
computation. This is handled by a list, which contains the processes that are member of 
the group. The list is called the view of the group. The history of the group is represented 
as a sequence of views vq, . ■ . v m , a new view being installed each time a process leaves 
or joins the group. 

In this model, processes that crash do not recover. This does not prevent crashed 
processes from recovering. However, a process that recovers after a crash has to take a 
new identity before being able the rejoin the group. When a crashed process recovers in 
a new incarnation, it requests a view change to join the group again. During this view 
change, a state transfer occurs: the group communication system requests that one of 
the current members of the view makes a checkpoint, and this checkpoint is transferred 
to the joining process. Most current group-communication toolkits [15,16,17,18,19,20] 
are based on this model or models that are similar. 

Dynamic crash no-recovery group communication systems cannot tolerate the crash 
of all the members of a view. Depending on synchrony assumptions, if a view contains 
n processes, then at best n — 1 crashes can be tolerated. 

Static crash recovery model. In the static crash recovery model, the group is static, 
i.e., no process can join the group after system initialisation. In this model, processes 
have access to stable storage, which allows them to save (part of) their state. So, crashed 
processes can recover, keep the same identity, and continue their computation. Most 
database system implement their atomic commitment protocol in this model. 

While this model might seem natural, handling of recovery complicates the im- 
plementation. For this reason, in the context of group communication, this model has 
mostly been considered in papers [21,22]. Practical issues, like application recovery, are 
not well defined in this model (in [23] the recovery is log based). Because of the access 
to stable storage, static crash recovery group communication systems can tolerate the 
simultaneous crash of all the processes [21], 

Process classes. In one system model, processes do not recover after a crash. In the other 
model, processes may recover after a crash, and possibly crash again, etc. Altogether this 
leads us to consider three classes of processes: ( 1 ) green processes, which never crash, (2) 
yellow processes, which might crash one or many times, but eventually stay forever up, 
and (3) red processes, which either crash forever, or are unstable (they crash and recover 
indefinitely ). Figure 3 illustrates those three classes, along with the corresponding classes 
described by Aguilera et al. [21]. Our terminology, with the distinction between green 
and yellow processes, fits better the needs of this paper. In the dynamic crash no-recovery 
model processes are either green or red. In the static crash recovery model, processes 
may also be yellow. 

Atomic Broadcast. We consider that the group communication component offers an 
atomic broadcast primitive. Informally, atomic broadcast ensures that messages are de- 
livered in the same order by all destination processes. Formally, atomic broadcast is 
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defined by two primitives A-broadcast and A-deliver that satisfy the following prop- 
erties: 

Validity: If a process A-delivers to, then to was A-broadcast by some process. 
Uniform Agreement: If a process A-delivers a message m, then all non-red processes 
eventually A-deliver to. 

Uniform Integrity: For every message m, every process A-delivers to at most once. 
Uniform Total Order: If two process p and q A-deliver two messages to and to', then 
p delivers to before m! if and only if q delivers to before to'. 

In the following, we assume a system model where the atomic broadcast problem 
can be solved, e.g., the asynchronous system model with failure detectors [24,21], or the 
synchronous system model [25]. 

2.4 Inter-component Communications 

Inter-component communication, and more specifically communication between the 
group communication component and the application component, is usually done using 
function calls. This leads to problems in case of a crash, since a message might have 
been delivered by the group communication component, but the application might not 
have processed it. To address this issue, we express the communication between the 
group communication layer and the application layer as messages (Fig. 4). When the 
application executes A-send( m) (A stands for Atomic Broadcast), it sends the message 
(to, A-send) to the group communication layer. To deliver message to to the application 
(i.e execute A-deliver(m)), the group communication component sends the message 
(m, A-deliver) to the application. 

So, we model the inter-component (intra-process) communication in the same way 
as inter-process communication. The main difference is that all components reside in 
the same process, and therefore fail together. This inter-layer communication is reliable 
(no message loss), except in case of a crash. 
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Fig. 5. Unrecoverable failure scenario 



3 Group Communication-Based Database Replication Is Not 
2-Safe 

In this section, we show that traditional group communication systems cannot be used 
to implement 2-safe replication. There are two reasons for this. The first problem that 
arises when trying to build a 2 -safe system is the number of crashes the system can 
tolerate. The 2-safety criterion imposes no bounds on the number of servers that may 
crash, but the dynamic crash no-recovery model does not tolerate the crash of all servers. 
This issue can be addressed by relying on the static crash recovery model. 

The second problem is not linked to the model, but related to message delivery and 
recovery procedures. The core problem lies in the fact that the delivery of a message does 
not ensure the processing of that message [26]. Ignoring this fact can lead to incorrect 
recovery protocols [27]. Note that this second problem exists in all group communication 
toolkits, which rely on the state transfer mechanism for recovery regardless of the model 
they are implemented in. 

To illustrate this problem, consider the scenario illustrated in Fig. 5. Transaction t is 
submitted on the delegate server Sd- When t terminates, S,i sends a message m containing 
t to all replicas. The message m is sent using an atomic broadcast. The delegate S,i 
delivers m, the local database component locally logs and commits t, and confirms the 
commit to the client: transaction t is committed in the database component of Sd- Then 
Sd crashes. All other replicas (S -2 and S 3 ) deliver m, i.e., the group communication 
components of Si, S2 and S3 have done their job. Finally S2 and S3 crash (before 
committing t), and later recover (before Sd)- 

The system cannot rebuild a consistent state that includes t’s changes. Servers S 2 
and S3 recover to the state of the database D that does not include the execution of t. 
Message m that contained t is not kept in any group communication component (it was 
delivered everywhere) and t was neither committed nor logged on servers S2 and S3: 
the technique is not 2 -safe. 

In this replication scheme, when a client is notified of the commit of transaction t, 
the only guarantee is that t was committed by the delegate Sd- The use of group com- 
munication does not ensure that t will commit on the other servers, but merely that the 
message m containing t will be delivered on all servers in the view. If those servers 
crash after the time of m’s delivery and before t, is actually committed or logged to disk, 
then transaction t is lost. In the scenario of Fig. 5, if the recovery is based on the state 
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transfer mechanism (Sect. 2.3), there is no available server that has a state containing t ’ s 
changes. If recovery is log-based (Sect. 2.3), the group communication system cannot 
deliver again message m without violating the uniform integrity property (m cannot be 
delivered twice). 

The problem lies in the lack of end-to-end guarantees of group communication 
systems described by Cheriton and Skeen [28] and is related to the fact that message 
delivery is not an atomic event. Group communication systems enforce guarantees on 
the delivery of messages to the application, but offer no guarantees with respect to the 
application level: 2-safety is an application level guarantee. 

4 Group Communication with End-to-End Guarantees for 2-Safe 
Replication 

We have shown in the previous section that it is impossible to implement a 2-safe database 
replication technique using a group communication toolkit that offers a traditional atomic 
broadcast. In order to build a 2-safe replication technique, we need to address the end- 
to-end issue. 

4.1 Ad-hoc Solution 

One way to solve the problem would be to add more messages to the protocol: for in- 
stance each server could send a message signalling that t was effectively logged and will 
eventually commit. The delegate Sj, would confirm the commit to the client after receiv- 
ing those messages. This approach has been proposed by Keidar et al. for implementing a 
group communication toolkit on top of another group communication toolkit [29]. While 
such an approach would work it has two drawbacks. First the technique would have a 
higher latency because of the additional waiting: synchronisation between replicas is 
expensive [5]. But most importantly, this approach ruins the modularity of the architec- 
ture. The point of using a group communication system is to have all complex network 
protocols implemented by the group communication component and not to clutter the 
application with communication issues. If additional distributed protocols are imple- 
mented in an ad-hoc fashion inside the application, they risk being less efficient (some 
functionality of the group communication will be duplicated inside the application, in 
this case, acknowledgement messages), and less reliable (distributed system protocols 
tend to be complex; when implemented in an ad-hoc fashion, they might be incorrect). 

4.2 End-to-End Atomic Broadcast 

The problem of lost transactions appears when a crash occurs between the time a message 
is delivered and the time it is processed by the application. When a message is delivered 
to the application and the application is able to process the message, we say that the 
delivery of the message is successful. However, we cannot realistically prevent servers 
from crashing during the time interval between delivery and successful delivery. In the 
event of a crash, messages that were not successfully delivered must be delivered again: 
we have to make sure that all messages are eventually delivered successfully. 
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Fig. 6. Messages exchange for with successful delivery 



With current group communication primitives, there is no provision for specify- 
ing successful delivery. For this reason, we introduce a new inter-component message 
that acknowledges the end of processing of m (i.e., successful delivery of to). We de- 
note this message ack(m). The mechanism is similar to acknowledgement messages 
used in inter-process communications. Figure 6 shows the exchange of messages for 
an atomic broadcast. First, the application sends message m, represented by the inter- 
component message (m, A-send) to the group communication system. When the group 
communication components is about to deliver to, it sends the inter-component mes- 
sage (to, A-deliver). Once the application has processed message to, it sends the inter- 
component message (to, ack) to signal that to is successfully delivered. 

If a crash occurs, and the group communication component did not receive the mes- 
sage (to, ack), then (to, deliver) should be sent again to the application upon recovery. 
This requires the group communication component to log messages and to use log-based 
recovery (instead of checkpoint-based recovery). So after each crash, the group com- 
munication component “replays” all messages to such that (to, ack) was not received 
from the application. By replaying messages, the group communication component en- 
sures that, if the process is eventually forever up, i.e., non-red, then all messages will 
eventually be successfully delivered. 

We call the new primitive end-to-end atomic broadcast. The specification of end- 
to-end atomic broadcast is similar to the specification of atomic broadcast in Sect. 2.3, 
except for (1) a new end-to-end property, and (2) a refined uniform integrity property: 
a message to might be delivered multiple times, but can only be delivered successfully 
once. A message m is said to be successful delivered when ack(m) is received. The new 
properties are the following: 

End-to-End: If a non-red process A-delivers a message to, then it eventually success- 
fully A-delivers to. 

Uniform Integrity: For every message m, every process successfully A-delivers m at 
most once. 

We assume a well-behaved application, that is, when the application receives message 
(to, A-deliver) from the group communication component, it sends (m, ack) as soon as 
possible. 

4.3 2-Safe Database Replication Using End-to-End Atomic Broadcast 

2-safe database replication can be built using end-to-end atomic broadcast. The repli- 
cation technique uses the end-to-end atomic broadcast instead of the ’’classical” atomic 
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Fig. 7. Recovery with end-to-end atomic broadcast 



broadcast. The only difference is that the replication technique must signal successful 
delivery, i.e., generate ack (to) . This happens when the transaction t contained in m 
is logged and is therefore guaranteed to commit. According to the specification of the 
end-to-end atomic broadcast primitive, every non-red process eventually successfully 
delivers to. The testable transaction abstraction described in Sect. 2.2 ensures that a 
transaction is commited at most once. So every process that is not permanently crashed 
or unstable eventually commits t exactly once: the technique is 2-safe. 

Figure 7 shows the scenario of Fig. 5 using end-to-end atomic broadcast. After the 
recovery of servers S 2 and S 3 , message to is delivered again. This time, S 2 and S 3 do 
not crash, the delivery of to is successful and t is committed on all available servers. 

5 A New Safety Criterion: Group-Safety 

We have shown in Sect. 3 that the techniques of Sect. 1 based on traditional group 
communication are not 2-safe. They are only 1-safe: when the client is notified of t’s 
commit, t did commit on the delegate server. As shown in Sect. 4, 2-safety can be ob- 
tained by extending group communication with end-to-end guarantees. However, group 
communication without end-to-end guarantees, even though it does not ensure 2-safety, 
provides an additional guarantee that is orthogonal to 1 -safety and 2-safety. We call this 
guarantee group-safety. 

5.1 Group Safety 

A replication technique is group-safe if, when a client receives confirmation of a trans- 
action’s commit, the message that contains the transaction is guaranteed to be delivered 
(but not necessarily processed) on all available servers. In contrast, 2-safety guarantees 
that the transaction will be processed (i.e., logged) on all available servers. Group-safety 
relies on the group of servers to ensure durability, whereas 2-safety relies on stable stor- 
age. With group-safety, if the group does not fail, i.e., enough servers remain up then 
durability is ensured (the number of servers depends on the system model and the algo- 
rithm used, typically a majority of the servers must stay up). Notice that group safety 
does not guarantee that the transaction was logged or committed on any replica. A client 
might be notified of the termination of some transaction t before t was actually logged 
on any replica. 
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Table 1 . Summary of different safety levels 



Transaction Logged 

No Replica 1 Replica All Replica 



No Safety 
(0-Safe) 


1-Safe 




Group-Safe 


Group-Safe 
& 1-Safe 
(Group-1 -Safe) 


2-Safe 



The relationship between group-safety, 1 -safety and 2-safety is summarised in Ta- 
ble 1. We use two criteria: (1) the number of servers that are guaranteed to deliver the 
(message containing the) transaction (vertical axis), and (2) the number of servers that 
are guaranteed to eventually commit the transaction (horizontal axis), that is the number 
of servers that have logged the transaction. We distinguish a transaction delivered on 
(one, all) replicas, and a transaction logged on (none, one, all) replicas. A transaction 
cannot be logged on a site before being delivered, so the corresponding entry in the table 
is grayed out. For each remaining entry in the table the corresponding safety level is 
indicated: 



No Safety: The client is notified as soon as the transaction t is delivered on one server 
Sd (t did not yet commit). No safety is enforced. If Sd crashes before t’s writes are 
flushed to stable storage, then t is lost. We call this O-safe replication. 

1-Safe: With 1 -safety, the client is notified when transaction t is delivered and logged on 
one server only, the delegate server Sd- If Sd crashes, then t might get lost. Indeed, 
while Sd is down, the system might commit new transactions that conflict with t: 
t must be discarded when Sd recovers. The only alternative would be to block all 
new transactions while Sd is down [30]. 

Group-Safe: The client is notified when a transaction is guaranteed to be delivered 
on all available servers (but might not be logged on any servers). If the group 
fails because too many servers crash, then t might be lost. Group-safe replication 
basically allows all disk writes to be done asynchronously (outside of the scope 
of the transaction) thus enabling optimisations like write caching. Typically, disk 
writes would not be done immediately, but periodically. Writes of adjacent pages 
would also be scheduled together to maximise disk throughput. 

Group-Safe & 1-Safe: The client is notified when transaction t is guaranteed to be 
delivered on all servers and was logged on one server, the delegate Sd and thus will 
eventually commit on Sd- Since the system is both group-safe and 1-safe, we call 
this safety level group- 1 -safety. With group- 1- safety, the transaction might be lost if 
too many servers, including Sd, crash. A transaction loss occurs either if Sd never 
recovers, or the system accepts conflicting transactions while Sd is crashed [30]. 
Most proposed database replication strategies based on group communication fall 
in this category [31,10,32,33,34]. 
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Table 2. Safety property and number of crashes 



Tolerated Number of Crashes 


Safety Property 


0 crashes 

less than n crashes 
n crashes 


0-safe, 1-safe 
group-safe, group-l-safe 
2-safe 



Table 3. Safety of comparison between group safety and group- 1 -safety 





Group does 
not fail 


Group fails 
S d does not 
crash 


Group fails 
S d crashes 


Group Safe 


-V 


K 


Sj. 


Group 1-Safe 




V 

s s 


^r. 



2-Safe: The client is notified when a transaction is logged on all available servers. Even 
if all servers crash, the transaction will eventually commit and therefore cannot get 
lost. 

If we consider the number of crashes that can be tolerated, we have basically three 
safety levels (Table 2): a) 0-safe and 1-safe replication cannot tolerate any crash, i.e., 
one single crash can lead to loose a transaction, b) Group-safe replication cannot tolerate 
the crash of all n servers, c) 2-safe replication can tolerate the crash of all n servers. 

5.2 Group Safety Is Preferable to Group-l-Safety 

Group-safe as well as group- 1 -safe replication techniques cannot tolerate the crash of 
all servers. So, what is the real difference between both criteria? Table 3 summarises the 
conditions that lead to the loss of the transaction, using two criteria: (1) failure of the 
group (typically failure of a majority of servers) and (2) crash of the delegate server Sd- 
The difference appears in the middle column (failure of the group, but not of Sd)- 

Group communication-based replication scheme are specially interesting in update- 
everywhere settings where the strong properties of atomic broadcast are used to handle 
concurrent transactions. If the replication is update-every where, then all servers Si ... S n 
might be the delegate server for some transaction. 3 If the group fails, at least one server 
crashed, and this server might be the delegate server Sd for some transaction t. In this 
case, the middle column of Table 3 does not exist. In such settings it makes little sense 
to deploy a group- 1 -safe replication technique. It must be noted that switching between 
group-l-safe and group-safe can be done easily at runtime: an actual implementation 
might choose to switch between both modes depending on the situation. 

The replication technique illustrated in Fig. 2 ensures group- 1 -safety. It can be trans- 
formed into group-safe-only quite easily. Figure 8 illustrates the group-safe version of 
the same technique. Read operations are typically done only on the delegate server Sd 

3 This is not the case with the primary-copy technique. 



Beyond 1-Safety and 2-Safety for Replicated Databases: Group-Safety 



177 




Table 4. Simulator parameters 



Parameter 


Value 


Number of items in the database 


10’ 000 


Number of Servers 


9 


Number of Clients per Server 


4 


Disks per Server 


2 


CPUs per Server 


2 


Transaction Length 


10-20 Operations 


Probability that an operation is a write 


50% 


Probability that an operation is a query 


50% 


Buffer hit ratio 


20% 


Time for a read 


4 - 12 ms 


Time for a write 


4 - 12 ms 


CPU Time used for an I/O operation 


0.4 ms 


Time for a message or a broadcast on the Network 


0.07 ms 


CPU time for a network operation 


0.07 ms 



before the broadcasting, writes are executed once the transactions is delivered by the 
atomic broadcast. The main difference with Fig. 2 is the response to the client, which 
is sent back as soon as the transactions is delivered by the atomic broadcast and the 
decision to commit/abort the transaction is known. The observed response time by the 
client is shortened by the time needed to write the decision to disk. The performance 
gain is shown by the simulation results presented in Sect. 6. 



6 Performance Evaluation 

In this section we compare the performance of group-safety, group- 1 -safety and 1 -safety 
(i.e., lazy replication). The evaluation is done using a replicated database simulator [5]. 
The group communication-based technique is the database state machine technique [10], 
which is an instance of the replication technique illustrated on Fig. 2 (for group- 1 -safety) 
and Fig. 8 (for group-safety). The setting of the simulator are described in Table 4. The 
load of the system is between 20 and 40 transactions per second; the network settings 
correspond to a 100 Mb/s LAN. All three techniques used the same logging setting, so 
they share the same throughput limits. 

Figure 9 shows the results of this experiment. The X axis represents the load of the 
system in transactions per second, the Y axis the response time, in milliseconds. 

Each replication technique is represented by one curve. The results show that group- 
safe replication has very good performance: it even outperforms lazy replication when 
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Fig. 9. Simulation results 
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Fig. 10. group-safe replication and lazy replication 



the load is below 38 transactions per second. The abort rate of the group-safe technique 
was constant, slightly below 7%. As the lazy technique does no conflict handling, abort 
rate is unknown. The very good performance of the group safe technique is due to the 
asynchrony of the writes (the writes to disk are done outside the scope of the transaction). 

In high-load situations, group-safe replication becomes less efficient than lazy repli- 
cation. The results show also that group- 1 -safe replication behaves significantly worse 
than group-safe replication: the technique scales poorly when the load increases. 

To summarise, the results show that transferring the responsibility of durability from 
stable storage to the group is a good idea in a LAN: in our setting, writing to disk takes 
around 8 ms, while performing an atomic broadcast takes approximately 1 ms. 



7 Group-Safe Replication vs. Lazy Replication 

On a conceptual level, group-safe replication can be seen as a complement to lazy 
replication. Both approaches try to get better performance by weakening the link between 
some parts of the system. Figure 10 illustrates this relationship. Group-safe replication 
relaxes the link between server and stable storage: when a transaction t commits, the 
state in memory and in stable storage might be different (f’s writes are not committed 
to disk, they are done asynchronously). 
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Lazy replication relaxes the link between replicas: when a transaction commits, the 
state in the different replicas might be different (some replicas have not seen transaction t; 
t’s writes are sent asynchronously). The two approaches relax the synchrony that is 
deemed too expensive. 

The main difference is the condition that leads to a violation of the ACID properties. 
In an update-everywhere setting, a lazy technique can violate the ACID properties even 
if no failure occurs. On the other hand, a group-safe replication will only violate this 
ACID properties if the group fails (too many servers crash). Group-safe replication has 
another advantage over lazy replication. With lazy replication in an update-everywhere 
setting, if the number of servers grow, the chances that two transaction originating from 
two different sites conflict grows. So the chances that the ACID properties are violated 
grows with the number of servers. 

With group-safe replication the ACID properties might get violated if too many 
servers crash. If we assume that the probability of the crash of a server is independent 
of the number of servers, the chance of violating the ACID properties decreases when 
the number of servers increases. So, the chances that something bad happens increases 
with n for lazy replication, and decreases with group-safe replication. 



8 Related Work 

As already mentionned, traditional database replication is usually either (i) 2-safe and 
built around an atomic commitment protocol like 2PC, or (ii) does not rely on atomic 
commitment and is therefore 1-safe [12]. As the the atomic commitment protocol is 
implemented inside the database system, coupling between database and communica- 
tion systems is not an issue. Techniques to improve atomic commitment using group 
communication have also been proposed [35,36,37]. 

The fact that 2-safety does not require atomic commitment has been hinted at in [38]. 
The paper explores the relationship between safety levels and the semantics of commu- 
nication protocols. However, the distinction between 2-safety and the safety properties 
ensured by traditional group communication does not appear explicitely in [38]. 

While the notion of group safety is formally defined here, existing database replica- 
tion protocols have in the past relied on this property, e.g., [39,27], The trade-off between 
2-safety and group-safety has never been presented before. 

The COReL toolkit [29] is a group communication toolkit that is built on top of 
another group communication toolkit. The COReL toolkit has to cope with the absence 
of end-to-end guarantees of the underlying toolkit. This issue is addressed by logging 
incoming messages and sending explicit acknowledgement messages on the network. 
However, an application built on top of COReL will not get end-to-end guarantees. 

The issue of end-to-end properties for application are mentioned in [40]. While the 
proposed solution solves the problem of partitions and partition healing, the issue of 
synchronisation between application and group communication toolkit is not discussed. 
In general, partitionable group membership systems solve some issues raised in this 
paper: failure of the group communication because of crashes and partitions [41], Yet, 
the issue of application recovery after a crash is not handled. 
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The Disk Paxos[22] algorithm can also be loosely related to 2-safety, even though 
the paper does not address database replication issues. The paper presents an original 
way, using stable storage, to couple the application component with a component solving 
an agreement problem. However, the paper assumes a network attached storage, which 
is quite different from the model considered here, where each network node only has 
direct access to its own database. 

The issue of connecting the group communication component and the database com- 
ponent can also be related to the exactly once property in three-tier applications [13]. 
In our case, group communication system and database system can be seen as two tiers 
co-located on the same machine that communicate using messages. 

9 Conclusion 

In this paper, we have shown that traditional group communication primitives are not 
suited for building 2-safe database replication techniques. This led us to introduce end- 
to-end atomic broadcast to solve the problem. We have also shown that, while traditional 
group communication (without end-to-end guarantees) are not suited for 2-safe replica- 
tion, they offer stronger guarantees than 1 -safety. To formalise this, we have introduced 
a new safety criterion, called group-safety that captures the safety guarantees of group 
communication-based replication techniques. While this safety criterion is natural in 
distributed systems, it is less in the replicated database context. Performance evalua- 
tion show that group-safe replication compares favourably to lazy replication, while 
providing better guarantees in terms of the ACID properties. 
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Abstract. In recent years, privacy preserving data mining has become 
an important problem because of the large amount of personal data 
which is tracked by many business applications. In many cases, users are 
unwilling to provide personal information unless the privacy of sensitive 
information is guaranteed. In this paper, we propose a new framework for 
privacy preserving data mining of multi-dimensional data. Previous work 
for privacy preserving data mining uses a perturbation approach which 
reconstructs data distributions in order to perform the mining. Such an 
approach treats each dimension independently and therefore ignores the 
correlations between the different dimensions. In addition, it requires the 
development of a new distribution based algorithm for each data mining 
problem, since it does not use the multi-dimensional records, but uses 
aggregate distributions of the data as input. This leads to a fundamental 
re-design of data mining algorithms. In this paper, we will develop a new 
and flexible approach for privacy preserving data mining which does 
not require new problem-specific algorithms, since it maps the original 
data set into a new anonymized data set. This anonymized data closely 
matches the characteristics of the original data including the correlations 
among the different dimensions. We present empirical results illustrating 
the effectiveness of the method. 



1 Introduction 

Privacy preserving data mining has become an important problem in recent 
years, because of the large amount of consumer data tracked by automated sys- 
tems on the internet. The proliferation of electronic commerce on the world wide 
web has resulted in the storage of large amounts of transactional and personal 
information about users. In addition, advances in hardware technology have also 
made it feasible to track information about individuals from transactions in ev- 
eryday life. For example, a simple transaction such as using the credit card results 
in automated storage of information about user buying behavior. In many cases, 
users are not willing to supply such personal data unless its privacy is guaran- 
teed. Therefore, in order to ensure effective data collection, it is important to 
design methods which can mine the data with a guarantee of privacy. This has 
resulted to a considerable amount of focus on privacy preserving data collection 
and mining methods in recent years [1], [2], [3], [4], [6], [8], [9], [12], [13]. 
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A perturbation based approach to privacy preserving data mining was pio- 
neered in [1], This technique relies on two facts: 

— Users are not equally protective of all values in the records. Thus, users 
may be willing to provide modified values of certain fields by the use of a 
(publically known) perturbing random distribution. This modified value may 
be generated using custom code or a browser plug in. 

— Data Mining Problems do not necessarily require the individual records, 
but only distributions. Since the perturbing distribution is known, it can 
be used to reconstruct aggregate distributions. This aggregate information 
may be used for the purpose of data mining algorithms. An example of a 
classification algorithm which uses such aggregate information is discussed 
in [1], 

Specifically, let us consider a set of n original data values x\ . . . x n . These are 
modelled in [1] as n independent values drawn from the data distribution X. 
In order to create the perturbation, we generate n independent values yi ■ ■ - y n , 
each with the same distribution as the random variable Y. Thus, the perturbed 
values of the data are given by x\ + j/i, . . . x n + y n . Given these values, and the 
(publically known) density distribution fy for Y, techniques have been proposed 
in [1] in order to estimate the distribution fx for X. An iterative algorithm has 
been proposed in the same work in order to estimate the data distribution fx- 
A convergence result was proved in [2] for a refinement of this algorithm. In 
addition, the paper in [2] provides a framework for effective quantification of the 
effectiveness of a (perturbation-based) privacy preserving data mining approach. 

We note that the perturbation approach results in some amount of informa- 
tion loss. The greater the level of perturbation, the less likely it is that we will 
be able to estimate the data distributions effectively. On the other hand, larger 
perturbations also lead to a greater amount of privacy. Thus, there is a natural 
trade-off between greater accuracy and loss of privacy. 

Another interesting method for privacy preserving data mining is the k- 
anonymity model [18]. In the fc-anonymity model, domain generalization hier- 
archies are used in order to transform and replace each record value with a 
corresponding generalized value. We note that the choice of the best general- 
ization hierarchy and strategy in the fc-anonymity model is highly specific to a 
particular application, and is in fact dependent upon the user or domain expert. 
In many applications and data sets, it may be difficult to obtain such precise do- 
main specific feedback. On the other hand, the perturbation technique [1] does 
not require the use of such information. Thus, the perturbation model has a 
number of advantages over the fc-anonymity model because of its independence 
from domain specific considerations. 

The perturbation approach works under the strong requirement that the data 
set forming server is not allowed to learn or recover precise records. This strong 
restriction naturally also leads to some weaknesses. Since the former method does 
not reconstruct the original data values but only distributions, new algorithms 
need to be developed which use these reconstructed distributions in order to 
perform mining of the underlying data. This means that for each individual 
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data problem such as classification, clustering, or association rule mining, a new 
distribution based data mining algorithm needs to be developed. For example, 
the work in [1] develops a new distribution based data mining algorithm for the 
classification problem, whereas the techniques in [9], and [16] develop methods for 
privacy preserving association rule mining. While some clever approaches have 
been developed for distribution based mining of data for particular problems 
such as association rules and classification, it is clear that using distributions 
instead of original records greatly restricts the range of algorithmic techniques 
that can be used on the data. Aside from the additional inaccuracies resulting 
from the perturbation itself, this restriction can itself lead to a reduction of the 
level of effectiveness with which different data mining techniques can be applied. 

In the perturbation approach, the distribution of each data dimension is re- 
constructed 1 independently. This means that any distribution based data min- 
ing algorithm works under an implicit assumption of treating each dimension 
independently. In many cases, a lot of relevant information for data mining al- 
gorithms such as classification is hidden in the inter-attribute correlations [14]. 
For example, the classification technique in [1] uses a distribution-based ana- 
logue of a single-attribute split algorithm. However, other techniques such as 
multi-variate decision tree algorithms [14] cannot be accordingly modified to 
work with the perturbation approach. This is because of the independent treat- 
ment of the different attributes by the perturbation approach. This means that 
distribution based data mining algorithms have an inherent disadvantage of loss 
of implicit information available in multi-dimensional records. It is not easy to 
extend the technique in [1] to reconstruct multi-variate distributions, because 
the amount of data required to estimate multi-dimensional distributions (even 
without randomization) increases exponentially 2 with data dimensionality [17]. 
This is often not feasible in many practical problems because of the large number 
of dimensions in the data. 

The perturbation approach also does not provide a clear understanding of 
the level of indistinguislrability of different records. For example, for a given level 
of perturbation, how do we know the level to which it distinguishes the different 
records effectively? While the fc-anonymity model provides such guarantees, it 
requires the use of domain generalization hierarchies, which are a constraint 
on their effective use over arbitrary data sets. As in the fc-anonymity model, 
we use an approach in which a record cannot be distinguished from at least 
fc other records in the data. The approach discussed in this paper requires the 
comparison of a current set of records with the current set of summary statistics. 
Thus, it requires a relaxation of the strong assumption of [1] that the data set 



1 Both the local and global reconstruction methods treat each dimension indepen- 
dently. 

2 A limited level of multi- variate randomization and reconstruction is possible in sparse 
categorical data sets such as the market basket problem [9]. However, this specialized 
form of randomization cannot be effectively applied to a generic non-sparse data sets 
because of the theoretical considerations discussed. 
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forming server is not allowed to learn or recover records. However, only aggregate 
statistics are stored or used during the data mining process at the server end. 

A record is said to be k-indistinguishable , when there are at least k other 
records in the data from which it cannot be distinguished. The approach in 
this paper re-generates the anonymized records from the data using the above 
considerations. The approach can be applied to either static data sets, or more 
dynamic data sets in which data points are added incrementally. Our method 
has two advantages over the fc-anonymity model: 

(1) It does not require the use of domain generalization hierarchies as in the 
/e-anonymity model. 

(2) It can be effectively used in situations with dynamic data updates such as the 
data stream problem. This is not the case for the work in [18], which essentially 
assumes that the entire data set is available apriori. 

This paper is organized as follows. In the next section, we will introduce the 
locality sensitive condensation approach. We will first discuss the simple case 
in which an entire data set is available for application of the privacy preserving 
approach. This approach will be extended to incrementally updated data sets 
in section 3. The empirical results are discussed in section 4. Finally, section 5 
contains the conclusions and summary. 



2 The Condensation Approach 

In this section, we will discuss a condensation approach for data mining. This 
approach uses a methodology which condenses the data into multiple groups of 
pre-defined size. For each group, a certain level of statistical information about 
different records is maintained. This statistical information suffices to preserve 
statistical information about the mean and correlations across the different di- 
mensions. Within a group, it is not possible to distinguish different records from 
one another. Each group has a certain minimum size k, which is referred to as 
the indistinguishability level of that privacy preserving approach. The greater 
the indistinguishability level, the greater the amount of privacy. At the same 
time, a greater amount of information is lost because of the condensation of a 
larger number of records into a single statistical group entity. 

Each group of records is referred to as a condensed unit. Let Q be a condensed 
group containing the records {X± . . . X k }. Let us also assume that each record 
Xi contains the d dimensions which are denoted by (xj . . . xf ). The following 
information is maintained about each group of records S: 

— For each attribute j, we maintain the sum of corresponding values. The 
corresponding value is given by Yli=i x 'i- We denote the corresponding first- 
order sums by Fsj(Q). The vector of first order sums is denoted by Fs(Q). 

— For each pair of attributes i and j , we maintain the sum of the product of 

corresponding attribute values. This sum is equal to x t ' x t- We denote 

the corresponding second order sums by Scij(Q). The vector of second order 
sums is denoted by Sc{Q). 
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— We maintain the total number of records k in that group. This number is 
denoted by n(Q). 

We make the following simple observations: 

Observation 1: The mean value of attribute j in group Q is given by 

Fsj(G)/n(Q). 

Observation 2: The covariance between attributes i and j in group Q is given 

by S Cij {g)/n(g ) - F Si (g ) ■ F Sj {g)/n{g) 2 . 

The method of group construction is different depending upon whether an 
entire database of records is available or whether the data records arrive in an 
incremental fashion. We will discuss two approaches for construction of class 
statistics: 

— When the entire data set is available and individual subgroups need to be 
created from it. 

— When the data records need to be added incrementally to the individual 
subgroups. 

The algorithm for creation of subgroups from the entire data set is a straight- 
forward iterative approach. In each iteration, a record X is sampled from the 
database V. The closest ( k — 1) records to this individual record X are added 
to this group. Let us denote this group by g. The statistics of the k records in 
g are computed. Next, the k records in g are deleted from the database T>, and 
the process is repeated iteratively, until the database V is empty. We note that 
at the end of the process, it is possible that between 1 and (k — 1) records may 
remain. These records can be added to their nearest sub-group in the data. Thus, 
a small number of groups in the data may contain larger than k data points. 
The overall algorithm for the procedure of condensed group creation is denoted 
by CreateCondensedGroups, and is illustrated in Figure 1. We assume that the 
final set of group statistics are denoted by TL. This set contains the aggregate 
vector ( Sc(g),Fs(g),n(g )) for each condensed group g. 



2.1 Anonymized-Data Construction from Condensation Groups 

We note that the condensation groups represent statistical information about the 
data in each group. This statistical information can be used to create anonymized 
data which has similar statistical characteristics to the original data set. This is 
achieved by using the following method: 

— A d* d co- variance matrix C{g ) is constructed for each group g. The ij th 
entry of the co-variance matrix is the co-variance between the attributes i 
and j of the set of records in g. 

— The eigenvectors of this co- variance matrix are determined. These eigenvec- 
tors are determined by decomposing the matrix C(g) in the following form: 

c(g) = p(g) ■ A(g) ■ p{g) T 



(i) 
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Algorithm CreateCondensedGroups(Ji\&\sth\gmsYi. Lvl.: k, 
Database: 27); 

begin 

while 27 contains at least k points do 

begin 

Randomly sample a data point X from 27; 

Q = (A'}; 

Find the closest ( k — 1) records to X and add to Q\ 
for each attribute j compute statistics Fsj{Q)\ 
for each pair of attributes i,j compute Sdj{Q)\ 

Set n(Q) = fc; 

Add the corresponding statistics of group Q to H; 

v = v-g- 

end; 

Assign each remaining point in 27 to the closest group 
and update the corresponding group statistics; 

end 

return(2t); 

end 



Fig. 1 . Creation of Condensed Groups from the Data 



The columns of P(G) represent the eigenvectors of the covariance matrix 
C(Q). The diagonal entries Xi(G) ■ ■ ■ A d(G) of A(Q) represent the corre- 
sponding eigenvalues. Since the matrix is positive semi-definite, the corre- 
sponding eigenvectors form an ortho-normal axis system. This ortho-normal 
axis-system represents the directions along which the second order correla- 
tions are removed. In other words, if the data were represented using this 
ortho-normal axis system, then the covariance matrix would be the diagonal 
matrix corresponding to A(G)- Thus, the diagonal entries of A(G) represent 
the variances along the individual dimensions. We can assume without loss 
of generality that the eigenvalues Xi(G) . . . A d(G) are ordered in decreasing 
magnitude. The corresponding eigenvectors are denoted by e±(G) ■ ■ ■ ed(G )• 

We note that the eigenvectors together with the eigenvalues provide us with an 
idea of the distribution and the co-variances of the data. In order to re-construct 
the anonymized data for each group, we assume that the data within each group 
is independently and uniformly distributed along each eigenvector with a vari- 
ance equal to the corresponding eigenvalue. The statistical independence along 
each eigenvector is an extended approximation of the second-order statistical 
independence inherent in the eigenvector representation. This is a reasonable 
approximation when only a small spatial locality is used. Within a small spatial 
locality, we may assume that the data is uniformly distributed without substan- 
tial loss of accuracy. The smaller the size of the locality, the better the accuracy 
of this approximation. The size of the spatial locality reduces when a larger 
number of groups is used. Therefore, the use of a large number of groups leads 
to a better overall approximation in each spatial locality. On the other hand, 
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the use of a larger number of groups also reduced the number of points in each 
group. While the use of a smaller spatial locality improves the accuracy of the 
approximation, the use of a smaller number of points affects the accuracy in 
the opposite direction. This is an interesting trade-off which will be explored in 
greater detail in the empirical section. 



2.2 Locality Sensitivity of Condensation Process 

We note that the error of the simplifying assumption increases when a given 
group does not truly represent a small spatial locality. Since the group sizes are 
essentially fixed, the level of the corresponding inaccuracy increases in sparse re- 
gions. This is a reasonable expectation, since outlier points are inherently more 
difficult to mask from the point of view of privacy preservation. It is also im- 
portant to understand that the locality sensitivity of the condensation approach 
arises from the use of a fixed group size as opposed to the use of a fixed group 
radius. This is because fixing the group size fixes the privacy (indistinguisha- 
bility) level over the entire data set. At the same time, the level of information 
loss from the simplifying assumptions depends upon the characteristics of the 
corresponding data locality. 



3 Maintenance of Condensed Groups in a Dynamic 
Setting 

In the previous section, we discussed a static setting in which the entire data 
set was available at one time. In this section, we will discuss a dynamic setting 
in which the records are added to the groups one at a time. In such a case, it 
is a more complex problem to effectively maintain the group sizes. Therefore, 
we make a relaxation of the requirement that each group should contain k data 



Algorithm DynamicGroupMaintenance(Database: T>, 
IncrementalStream: S, Distinguishability Factor: k) 

begin 

H = CreateCondensedGroups(k, T>); 

for each data point X received from incremental stream S do 

begin 

Find the nearest centroid in T-L to A; 

Add A' to corresponding group statistics Ad; 

if n(Ad) = 2 ■ k then (Adi, .M 2 ) = SplitGroupStatistics{M, k); 

Delete Ad from H; 

Add Adi to H; 

Add Ad 2 to H; 
end 
end 



Fig. 2. Overall Process of Maintenance of Condensed Groups 
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Algorithm SplitGroupStatistics(GroupStatistics: M, GroupSize: k)\ 

begin 

Determine covariance matrix C(M); 

{ The j, fctli entry of the covariance matrix is determined using the 
formula C jk (M) = Sc j k (M)/n ( M) - F Sj (M) ■ Fs k (M)/n(M) 2 ; } 
Determine eigenvectors ei(At) . . . ed(M) with eigenvalues Ai(At) . . . A <j(Af); 
{ Relationship is C(M) = P(M) ■ A(M ) • P(M) T 
Here A(A4) is a diagonal matrix; } 

{ Without loss of generality we assume that Ai(Af) > . . . > A } 
n(M 1 ) = n(M 2 ) = k\ 

Fs(Mi) = Fs(M)/n(M + eRAf) • 

Fs(M 2 ) = Fs(M)/n(M) - el(M) • yi2WT/4; 

Construct A (Mi) and A(M 2 ) by dividing diagonal entry Ai of A(M) by 4; 
P(Mi) = P(M 2 ) = P(M); 

C(Mi) = C(M 2 ) = P(Mi) • A(Mi) ■ P(Mi) T ; 

for each pair of attributes i,j do 

begin 

Sdj(Mi) = k ■ Cij(M i) + Fsi(Mi) ■ F Sj (Mi)/k; 

ScijiM 2 ) = k ■ Cij(M 2 ) + Fsi(M 2 ) ■ F Sj (M 2 )/k ; 

end; 

end 



Fig. 3. Splitting Group Statistics (Algorithm) 




Fig. 4. Splitting Group Statistics (Illustration) 



points. Rather, we impose the requirement that each group should maintain 
between k and 2 • k data points. 

As each new point in the data is received, it is added to the nearest group, 
as determined by the distance to each group centroid. As soon as the number 
of data points in the group equals 2 • k, the corresponding group needs to be 
split into two groups of k points each. We note that with each group, we only 
maintain the group statistics as opposed to the actual group itself. Therefore, the 
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splitting process needs to generate two new sets of group statistics as opposed 
to the data points. Let us assume that the original set of group statistics to be 
split is given by M, and the two new sets of group statistics to be generated are 
given by Adi and M 2 . The overall process of group updating is illustrated by 
the algorithm DynamicGroupMaintenance in Figure 2. As in the previous case, 
it is assumed that we start off with a static database V. In addition, we have 
a constant stream S of data which consists of new data points arriving in the 
database. Whenever a new data point X is received, it is added to the group 
Ad, whose centroid is closest to X. As soon as the group size equals 2 • k, the 
corresponding group statistics needs to be split into two sets of group statistics. 
This is achieved by the procedure SplitGroupStatistics of Figure 3. 

In order to split the group statistics, we make the same simplifying assump- 
tions about (locally) uniform and independent distributions along the eigenvec- 
tors for each group. We also assume that the split is performed along the most 
elongated axis direction in each case. Since the eigenvalues correspond to vari- 
ances along individual eigenvectors, the eigenvector corresponding to the largest 
eigenvalue is a candidate for a split. An example of this case is illustrated in 
Figure 4. The logic of choosing the most elongated direction for a split is to 
reduce the variance of each individual group as much as possible. This ensures 
that each group continues to correspond to a small data locality. This is useful 
in order to minimize the effects of the approximation assumptions of uniformity 
within a given data locality. We assume that the corresponding eigenvector is 
denoted by eT and its eigenvalue by Ai. Since the variance of the data along eT 
is Ai, then the range (a) of the corresponding uniform distribution along ef is 
given 3 by a = y/12 • Ai- 

The number of records in each newly formed group is equal to k since the 
original group of size 2 • k is split into two groups of equal size. We need to 
determine the first order and second order statistical data about each of the 
split groups Mi and M 2 - This is done by first deriving the centroid and zero 
(second-order) correlation directions for each group. The values of Fsi(Q) and 
Scij(G) about each group can also be directly derived from these quantities. We 
will proceed to describe this derivation process in more detail. 

Let us assume that the centroid of the unsplit group M is denoted by Y ( M ). 
This centroid can be computed from the first order values Fs(M) using the 
following relationship: 

Y(M) = (Fs 1 (M),...Fs d (M))/n(g) (2) 

As evident from Figure 4, the centroids of each of the split groups Mi and M 2 
are given by Y (M) — {a/ 4) • eT and Y (M) + (a/4) • eT respectively. Therefore, the 
new centroids of the groups M\ and M 2 are given by Y(M) — (\/12 • Ai/4) • eT 
and Y (M) + {i/YFXi/A) • eT respectively. It now remains to compute the second 
order statistical values. This is slightly more tricky. 

3 This calculation was done by using the formula for the standard deviation of a 

u niform distribution with range a. The corresponding standard deviation is given by 

V^/12. 
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Once the co-variance matrix for each of the split groups has been computed, 
the second-order aggregate statistics can be derived by the use of the covariance 
values in conjunction with the centroids that have already been computed. Let 
us assume that the ij th entry of the co- variance matrix for the group Adi is 
given by Cij(. Adi). Then, from Observation 2, it is clear that the second order 
statistics of Adi may be determined as follows: 

ScijiMr) = k ■ Cij(Mi) + Fsi(Mi) ■ Fs^M^/k (3) 

Since the first-order values have already been computed, the right hand side 
can be substituted, once the co- variance matrix has been determined. We also 
note that the eigenvectors of Adi and Ad 2 are identical to the eigenvectors of 
Ad, since the directions of zero correlation remain unchanged by the splitting 
process. Therefore, we have: 

ei(Adi) = ei (Ad 2 ) = ei (Ad) 
e 2 (Adi) = e 2 (Ad 2 ) = e 2 (Ad) 
e3(Adi) = e3(Ad 2 ) = e3(Ad) 

e d (.Mi) = e d (M 2 ) = e d (M) 



The eigenvalue corresponding to el(Ad) is equal to Ai/4 because the splitting 
process along ey reduces the corresponding variance by a factor of 4. All other 
eigenvectors remain unchanged. Let P(Ad) represent the eigenvector matrix of 
Ad, and Z\(Ad) represent the corresponding diagonal matrix. Then, the new 
diagonal matrix Z\(Ad 1 ) = Z\(Ad 2 ) of Adi can be derived by dividing the entry 
Ai (Ad) by 4. Therefore, we have: 

Ai(Adi) = Ai(Ad 2 ) = Ai(Ad)/4 



The other eigenvalues of Adi and Ad 2 remain the same: 

A 2 (Adi) = A 2 (Ad 2 ) = A 2 (Ad) 

A 3 (Adi) = A 3 (Ad 2 ) = A 3 (Ad) 

Ad (Adi) = A d(Ad 2 ) = A d(Ad) 

Thus, the co- variance matrixes of Adi and Ad 2 may be determined as follows: 

C(Adi) = C(Ad 2 ) = P(Adi) • A{M{) • P(Mi) t (4) 

Once the co-variance matrices have been determined, the second order aggre- 
gate information about the data is determined using Equation 3. We note that 
even though the covariance matrices of Adi and Ad 2 are identical, the values 
of Scij(Mi) and S'c,: J (Ad 2 ) will be different because of the different first order 
aggregates substituted in Equation 3. The overall process for splitting the group 
statistics is illustrated in Figure 3. 
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3.1 Application of Data Mining Algorithms to Condensed Data 
Groups 

Once the condensed data groups have been generated, data mining algorithms 
can be applied to the anonymized data which is generated from these groups. 
After generation of the anonymized data, any known data mining algorithm can 
be directly applied to this new data set. Therefore, specialized data mining algo- 
rithms do not need to be developed for the condensation based approach. As an 
example, we applied the technique to the classification problem. We used a simple 
nearest neighbor classifier in order to illustrate the effectiveness of the technique. 
We also note that a nearest neighbor classifier cannot be effectively modified to 
work with the perturbation-based approach of [1]. This is because the method 
in [1] reconstructs aggregate distributions of each dimension independently. On 
the other hand, the modifications required for the case of the condensation ap- 
proach were relatively straightforward. In this case, separate sets of data were 
generated from each of the different classes. The separate sets of data for each 
class were used in conjunction with a nearest neighbor classification procedure. 
The class label of the closest record from the set of perturbed records is used for 
the classification process. 

4 Empirical Results 

Since the aim of the privacy preserving data mining process was to create a new 
perturbed data set with similar data characteristics, it is useful to compare the 
statistical characteristics of the newly created data with the original data set. 
Since the proposed technique is designed to preserve the covariance structure 
of the data, it would be interesting to test how the covariance structure of the 
newly created data set matched with the original. If the newly created data set 
has very similar data characteristics to the original data set, then the condensed 
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Fig. 7. (a) Classifier Accuracy and (b) Covariance Compatibility (Pima Indian) 




Fig. 8. (a) Classifier Accuracy and (b) Covariance Compatibility (Abalone) 
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data set is a good substitute for privacy preserving data mining algorithms. For 
each dimension pair ( i,j ), let the corresponding entries in the covariance matrix 
for the original and the perturbed data be denoted by o,_,- and p t j . In order to 
perform this comparison, we computed the statistical coefficient of correlation 
between the pairwise data entry pairs ( Oij,Pij ). Let us denote this value by p. 
When the two matrices are identical, the value of p is 1. On the other hand, when 
there is perfect negative correlations between the entries, the value of /i is — 1. 

We tested the data generated from the privacy preserving condensation ap- 
proach on the classification problem. Specifically, we tested the accuracy of a 
simple k - nearest neighbor classifier with the use of different levels of privacy. 
The level of privacy is controlled by varying the sizes of the groups used for 
the condensation process. The results show that the technique is able to achieve 
high levels of privacy without noticeably compromising classification accuracy. 
In fact, in many cases, the classification accuracy improves because of the noise 
reduction effects of the condensation process. These noise reduction effects result 
from the use of the aggregate statistics of a small local cluster of points in order 
to create the anonymized data. The aggregate statistics of each cluster of points 
often mask the effects of a particular anomaly 4 in it. This results in a more 
robust classification model. We note that the effect of anomalies in the data are 
also observed for a number of other data mining problems such as clustering [10]. 
While this paper studies classification as one example, it would be interesting to 
study other data mining problems as well. 

A number of real data sets from the UCI machine learning repository 5 were 
used for the testing. The specific data sets used were the Ionosphere, Ecoli, 
Pima Indian, and the Abalone Data Sets. Except for the Abalone data set, each 
of these data sets correspond to a classification problem. In the abalone data 
set, the aim of the problem is to predict the age of abalone, which is a regression 
modeling problem. For this problem, the classification accuracy measure used 
was the percentage of the time that the age was predicted within an accuracy of 
less than one year by the nearest neighbor classifier. 

The results on classification accuracy for the Ionosphere, Ecoli, Pima Indian, 
and Abalone data sets are illustrated in Figures 5(a), 6(a), 7(a) and 8(a) respec- 
tively. In each of the charts, the average group size of the condensation groups 
is indicated on the X-axis. On the Y-axis, we have plotted the classification ac- 
curacy of the nearest neighbor classifier, when the condensation technique was 
used. Three sets of results have been illustrated on each graph: 

— The accuracy of the nearest neighbor classifier when static condensation was 
used. In this case, the static version of the algorithm was used in which the 
entire data set was used for condensation. 

— The accuracy of the nearest neighbor classifier when dynamic condensation 
was used. In this case, the data points were added incrementally to the 
condensed groups. 

4 We note that a fc-nearest neighbor model is often more robust than a 1-nearest 
neighbor model for the same reason. 

5 http : // www. ics.uci.edu/~mlearn 
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— We note that when the group size was chosen to be one for the case of static 
condensation, the result was the same as that of using the classifier on the 
original data. Therefore, a horizontal line (parallel to the X-axis) is drawn in 
the graph which shows the baseline accuracy of using the original classifier. 
This horizontal line intersects the static condensation plot for a groups size 
of 1. 

An interesting point to note is that when dynamic condensation is used, the 
result of using a group size of 1 does not correspond to the original data. This is 
because of the approximation assumptions implicit in splitting algorithm of the 
dynamic condensation process. Specifically, the splitting procedure assumed a 
uniform distribution of the data within a given condensed group of data points. 
Such an approximation tends to lose its accuracy for very small group sizes. 
However, it should also be remembered that the use of small group sizes is not 
very useful anyway from the point of view of privacy preservation. Therefore, 
the behavior of the dynamic condensation technique for very small group sizes 
is not necessarily an impediment to the effective use of the algorithm. 

One of the interesting conclusions from the results of Figures 5(a), 6(a), 
7(a) and 8(a) is that the static condensation technique often provided better 
accuracy than the accuracy of a classifier on the original data set. The effects 
were particularly pronounced in the case of the ionosphere data set. As evident 
from Figure 5(a), the accuracy of the classifier on the statically condensed data 
was higher than the baseline nearest neighbor accuracy for almost all group sizes. 
The reason for this was that the process of condensation affected the data in two 
potentially contradicting ways. One effect was to add noise to the data because of 
the random generation of new data points with similar statistical characteristics. 
This resulted in a reduction of the classification accuracy. On the other hand, 
the condensation process itself removed many of the anomalies from the data. 
This had the opposite effect of improving the classification accuracy. In many 
cases, this trade-off worked in favor of improving the classification accuracy as 
opposed to worsening it. 

The use of dynamic classification also demonstrated some interesting results. 
While the absolute classification accuracy was not quite as high with the use of 
dynamic condensation, the overall accuracy continued to be almost comparable 
to that of the original data for modestly sized groups. The comparative behavior 
of the static and dynamic condensation methods is because of the additional 
assumptions used in the splitting process of the latter. We note that the splitting 
process uses a uniformly distributed assumption of the data distribution within a 
particular locality (group) . While this is a reasonable assumption for reasonably 
large group sizes within even larger data sets, the assumption does not work 
quite as effectively when either of the following is true: 

— When the group size is too small, then the splitting process does not estimate 
the statistical parameters of the two split groups quite as robustly. 

— When the group size is too large (or a significant fraction of the overall data 
size), then a set of points can no longer be said to represent a locality of the 
data. Therefore, the use of the uniformly distributed assumption for splitting 




A Condensation Approach to Privacy Preserving Data Mining 197 



and regeneration of the data points within a group is not as robust in this 
case. 

These results are reflected in the behavior of the classifier on the dynamically 
condensed data. In many of the data sets, the classification accuracy was sensitive 
to the size of the group. While the classification accuracy reduced upto the 
use of a group size of 10, it gradually improved with increasing groups size. In 
most cases, the classification accuracy of the dynamic condensation process was 
comparable to that on the original data. In some cases such as the Pima Indian 
data set, the accuracy of the dynamic condensation method was even higher 
than that of the original data set. Furthermore, the accuracy of the classifier 
on the static and dynamically condensed data was somewhat similar for modest 
group sizes between 25 to 50. One interesting result which we noticed was for 
the case of the Pima Indian data set. In this case, the classifier worked more 
effectively with the dynamic condensation technique as compared to that of 
static condensation. The reason for this was that the data set seemed to contain 
a number of classification anomalies which were removed by the splitting process 
in the dynamic condensation method. Thus, in this particular case, the splitting 
process seemed to improve the overall classification accuracy. While it is clear 
that the effects of the condensation process on classification tends to be data 
specific, it is important to note that the accuracy of the condensed data is quite 
comparable to that of the original classifier. 

We also compared the covariance characteristics of the data sets. The results 
are illustrated in Figures 5(b), 6(b), 7(b) and 8(b) respectively. It is clear that 
in each data set, the value of the statistical correlation /r was almost 1 for each 
and every data set for the static condensation method. In most cases, the value 
of \i was larger than 0.98 over all ranges of groups sizes and data sets. While the 
value of the statistical correlation reduced slightly with increasing group size, its 
relatively high value indicated that the covariance matrices of the original and 
perturbed data were virtually identical. This is a very encouraging result since it 
indicates that the approach is able to preserve the inter-attribute correlations in 
the data effectively. The results for the dynamic condensation method were also 
quite impressive, though not as accurate as the static condensation method. In 
this case, the value of /j continued to be very high (> 0.95) for two of the data 
sets. For the other two data sets, the value of fi reduced to the range of 0.65 to 
0.75 for very small group sizes. As the average group sizes increased to about 
20, this value increased to a value larger than 0.95. We note that in order for the 
indistinguishability level to be sufficiently effective, the group sizes also needed 
to be of sizes at least 15 or 20. This means that the accuracy of the classification 
process is not compromised in the range of group sizes which are most useful 
from the point of view of condensation. The behavior of the correlation statistic 
for dynamic condensation of small group sizes is because of the splitting process. 
It is a considerable approximation to split a small discrete number of discrete 
points using a uniform distribution assumption. As the group sizes increase, the 
value of fi increases because of the robustness of using a larger number of points 
in each group. However, increasing group sizes beyond a certain limit has the 
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opposite effect of reducing p (slightly). This effect is visible in both the static 
and dynamic condensation methods. The second effect is because of the greater 
levels of approximation inherent in using a uniform distribution assumption over 
a larger spatial locality. We note that when the overall data set size is large, it is 
more effectively possible to simultaneously achieve the seemingly contradictory 
goals of using the robustness of larger group sizes as well as the effectiveness 
of using a small locality of the data. This is because a modest group size of 30 
truly represents a small data locality in a large data set of 10000 points, whereas 
this cannot be achieved in a data set containing only 100 points. We note that 
many of the data sets tested in this paper contained less than 1000 data points. 
These constitute difficult cases for our approach. Yet, the condensation approach 
continued to perform effectively both for small data sets such as the Ionosphere 
data set, and for larger data sets such as the Pima Indian data set. In addition, 
the condensed data often provided more accurate results than the original data 
because of removal of anomalies from the data. 



5 Conclusions and Summary 

In this paper, we presented a new way for privacy preserving data mining of data 
sets. Since the method re-generates multi-dimensional data records, existing data 
mining algorithms do not need to be modified to be used with the condensation 
technique. This is a clear advantage over techniques such as the perturbation 
method discussed in [1] in which a new data mining algorithm needs to be 
developed for each problem. Unlike other methods which perturb each dimension 
separately, this technique is designed to preserve the inter-attribute correlations 
of the data. As substantiated by the empirical tests, the condensation technique 
is able to preserve the inter-attribute correlations of the data quite effectively. At 
the same time, we illustrated the effectiveness of the system on the classification 
problem. In many cases, the condensed data provided a higher classification 
accuracy than the original data because of the removal of anomalies from the 
database. 
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Abstract. XML suffers from the major limitation of high redundancy. Even if 
compression can be beneficial for XML data, however, once compressed, the data 
can be seldom browsed and queried in an efficient way. To address this prob- 
lem, we propose XQueC , an [XQue \ ry processor and [Cjompressor, which covers 
a large set of XQuery queries in the compressed domain. We shred compressed 
XML into suitable data structures, aiming at both reducing memory usage at query 
time and querying data while compressed. XQueC is the first system to take ad- 
vantage of a query workload to choose the compression algorithms, and to group 
the compressed data granules according to their common properties. By means of 
experiments, we show that good trade-offs between compression ratio and query 
capability can be achieved in several real cases, as those covered by an XML 
benchmark. On average, XQueC improves over previous XML query-aware com- 
pression systems, still being reasonably closer to general-purpose query-unaware 
XML compressors. Finally, QETs for a wide variety of queries show that XQueC 
can reach speed comparable to XQuery engines on uncompressed data. 



1 Introduction 

XML documents have an inherent textual nature due to repeated tags and to PCDATA 
content. Therefore, they lend themselves naturally to compression. Once the compressed 
documents are produced, however, one would like to still query them under a com- 
pressed form as much as possible (reminiscent of "lazy decompression” in relational 
databases [1], [2]). The advantages of processing queries in the compressed domain are 
several: first, in a traditional query setting, access to small chunks of data may lead to 
less disk I/Os and reduce the query processing time; second, the memory and compu- 
tation efforts in processing compressed data can be dramatically lower than those for 
uncompressed ones, thus even low-battery mobile devices can afford them; third, the 
possibility of obtaining compressed query results allows to spare network bandwidth 
when sending these results to a remote location, in the spirit of [3], 
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Previous systems have been proposed recently, i.e. XGrind [4] and XPRESS [5], 
allowing the evaluation of simple path expressions in the compressed domain. However, 
these systems are based on a naive top-down query evaluation mechanism, which is not 
enough to execute queries efficiently. Most of all, they are not able to execute a large 
set of common XML queries (such as joins, inequality predicates, aggregates, nested 
queries etc.), without spending prohibitive times in decompressing intermediate results. 

In this paper, we address the problem of compressing XML data in such a way 
as to allow efficient XQuery evaluation in the compressed domain. We can assert that 
our system, XQueC, is the first XQuery processor on compressed data. It is the first 
system to achieve a good trade-off among data compression factors, queryability and 
XQuery expressibility. To that purpose, we have carefully chosen a fragmentation and 
storage model for the compressed XML documents, providing selective access paths to 
the XML data, and thus further reducing the memory needed in order to process a query. 
The XQueC system has been demonstrated at VLDB 2003 [6]. 

The basis of our fragmentation strategy is borrowed from the XMill [7] project. 
XMill is a very efficient compressor for XML data, however, it was not designed to 
allow querying the documents under their compressed form. XMill made the important 
observation that data nodes (leaves of the XML tree) found on the same path in an 
XML document (e.g. /site/people/person/address/city in the XMark [8] documents) often 
exhibit similar content. Therefore, it makes sense to group all such values into a single 
container and choose the compression strategy once per container. Subsequently, XMill 
treated a container like a single “chunk of data” and compressed it as such, which 
disables access to any individual data node, unless the whole container is decompressed. 
Separately, XMill compressed and stored the structure tree of the XML document. 

While in XMill a container may contain leaf nodes found under several paths, leaving 
to the user or the application the task of defining these containers, in XQueC the frag- 
mentation is always dictated by the paths, i.e., we use one container per root-to-leaf path 
expression. When compressing the values in the container, like XMill, we take advantage 
of the commonalities between all container values. But most importantly, unlike XMill, 
each container value is individually compressed and individually accessible, enabling 
an effective query processing. 

We base our work on the principle that XML compression (for saving disk space) 
and sophisticated query processing techniques (like complex physical operators, indexes, 
query optimization etc.) can be used together when properly combined. This principle has 
been stated and forcefully validated in the domain of relational query processing [ 1], [3]. 
Thus, it is not less important in the realm of XML. 

In our work, we focus on the right compression of the values found in an XML docu- 
ment, coupled with a compact storage model for all parts of the document. Compressing 
the structure of an XML document has two facets. First, XML tags and attribute names 
are extremely repetitive, and practically all systems (indeed, even those not claiming 
to do “compression”) encode such tags by means of much more compact tag numbers. 
Second, an existing work [9] has addressed the summarization of the tree structure itself, 
connecting among them parent and child nodes. While structure compression is inter- 
esting, its advantages are not very visible when considering the XML document as a 
whole. Indeed, for a rich corpus of XML datasets, both real and synthetic, our measures 
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have shown that values make up 70% to 80% of the document structure. Projects like 
XGrind [4] and XPRESS [5] have already proposed schemes for value compression 
that would enable querying, but they suffer from limited query evaluation techniques 
(see also Section 1.2). These systems apply a fixed compression strategy regardless of 
the data and query set. In contrast, our system increases the compression benefits by 
adapting its compression strategy to the data and query workload, based on a suitable 
cost model. 

By doing data fragmentation and compression, XQueC indirectly targets the problem 
of main-memory XQuery evaluation, which has recently attracted the attention of the 
community [9], [10]. In [10], the authors show that some current XQuery prototypes 
are in practice limited by their large memory consumption; due to its small footprint, 
XQueC scales better (see Section 5). Furthermore, some such in-memory prototypes 
exhibit prohibitive query execution times even for simple lookup queries. [9] focuses 
on the problem of fitting into memory a narrowed version of the tree of tags, which is 
however a small percentage of the overall document, as explained above. 

XQueC addresses this problem in a two-fold way. First, in order to diminish its 
footprint, it applies powerful compression to the XML documents. The compression 
algorithms that we use allow to evaluate most predicates directly on the compressed 
values. Thus, decompression is often necessary only at the end of the query evaluation 
(see Section 4). Second, the XQueC storage model includes lightweight access support 
structures for the data itself, providing thus efficient primitives for query evaluation. 



1.1 The XQueC System 

The system we propose compresses XML data and queries them as much as possible 
under its compressed form, covering all real-life, complex classes of queries. 

The XQueC system adheres to the following principles; 

1. As in XMill, data is collected into containers, and the document structure stored 
separately. In XQueC, there is a container for each different < type,pe >, where pe 
is a distinguished root-to-leaf path expression and type is a distinguished elementary 
type. The set of containers is then partitioned again to allow for better sharing of 
compression structures, as explained in Section 2.2. 

2. In contrast with previous compression-aware XML querying systems, whose storage 
was plainly based on files, XQueC is the first to use a complete and robust storage 
model for compressed XML data, including a set of access support structures. Such 
storage is fundamental to guarantee a fast query evaluation mechanism. 

3. XQueC seamlessly extends a simple algebra for evaluating XML queries to include 
compression and decompression. This algebra is exploited by a cost-based optimizer, 
which may choose query evaluation strategies, that freely mix regular operator and 
compression-aware ones. 

4. XQueC is the first system to exploit the query workload to (i) partition the containers 
into sets according to the source model 1 and to (ii) properly assign the most suitable 

1 The source model is the model used for the encoding, for instance the Huffman encoding tree 
for Huffman compression [11] and the dictionary for ALM compression [12], outlined later. 
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compression algorithm to each set. We have devised an appropriate cost model, 
which helps making the right choices. 

5. XQueC is the first compressed XML querying system to use the order-preserving 2 
textual compression. Among several alternatives, we have chosen to use the 
ALM [12] compression algorithm, which provides good compression ratios and still 
allows fast decompression, which is crucial for an algorithm to be used in a database 
setting [13]. This feature enables XQueC to evaluate, in the compressed domain, 
the class of queries involving inequality comparisons, which are not featured by the 
other compression-aware systems. 

In the following sections, we will use XMark [8] documents for describing XQueC. 
A simplified structural outline of these documents is depicted in Figure 1 (at right). 
Each document describes an auction site, with people and open auctions (dashed lines 
represent IDREFs pointing to IDs and plain lines connect the other XML items). We 
describe XQueC following its architecture, depicted in Figure 1 (at left). It contains the 
following modules: 

1. The loader and compressor converts XML documents in a compressed, yet 
queryable format. A cost analysis leverages the variety of compression algorithms 
and the query workload predicates to decide the partition of the containers. 

2. The compressed repository stores the compressed documents and provides: ( i ) com- 
pressed data access methods, and (ii) a set of compression-specific utilities that 
enable, e.g., the comparison of two compressed values. 

3. The query processor evaluates XQuery queries over compressed documents. Its 
complete set of physical operators (regular ones and compression-aware ones) allows 
for efficient evaluation over the compressed repository. 

1.2 Related Work 

XML data compression was first addressed by XMill [7], following the principles out- 
lined in the previous section. After coalescing all values of a given container into a single 
data chunk, XMill compresses separately each container with its most suited algorithm, 
and then again with gz ip to shrink it as much as possible. However, an XMill-compressed 
document is opaque to a query processor: thus, one must fully decompress a whole chunk 
of data before being able to query it. 

The XGrind system [4] aims at query-enabling XML compression. XGrind does not 
separate data from structure: an XGrind-compressed XML document is still an XML 
document, whose tags have been dictionary-encoded, and whose data nodes have been 
compressed using the Huffman [11] algorithm and left at their place in the document. 
XGrind’s query processor can be considered an extended SAX parser, which can han- 
dle exact-match and prefix-match queries on compressed values and partial-match and 
range queries on decompressed values. However, several operations are not supported 
by XGrind, for example, non-equality selections in the compressed domain. Therefore, 
XGrind cannot perform any join, aggregation, nested queries, or construct operations. 

2 Note that a compression algorithm comp preserves order if for any xi, *2, comp(x 1) < 
comp(x 2) iff xi < X2- 
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Fig. 1 . Architecture of the XQueC prototype (left); simplified summary of the XMark XML doc- 
uments (right). 



Such operations occur in many XML query scenarios, as illustrated by XML benchmarks 
(e.g., all but the first two of the 20 queries in XMark [8]). 

Also, XGrind uses a fixed naive top-down navigation strategy, which is clearly in- 
sufficient to provide for interesting alternative evaluation strategies, as it was done in 
existing works on querying compressed relational data (e.g., [1], [2]). These works con- 
sidered evaluating arbitrary SQL queries on compressed data, by comparing (in the 
traditional framework of cost-based optimization) many query evaluation alternatives, 
including compression / decompression at several possible points. 

A third recent work, XPRESS [5] uses a novel reverse arithmetic encoding method, 
mapping entire path expressions to intervals. Also, XPRESS uses a simple mechanism 
to infer the type (and therefore the compression method suited) of each elementary data 
item. XPRESS’s compression method, like XGrind’s, is homomorphic, i.e. it preserves 
the document structure. 

To summarize, while XML compression has received significant attention [4], [5], 
[7], querying compressed XML is still in its infancy [4], [5], Current XML compression 
and querying systems do not come anywhere near to efficiently executing complex 
XQuery queries. Indeed, even the evaluation of XPath queries is slowed down by the 
use of the fixed top-down query evaluation strategy. 

Moreover, the interest towards compression even in a traditional data warehouse 
setting is constantly increasing in commercial systems, such as Oracle [14]. In [14], 
it is shown that the occupancy of raw data can be reduced while not impacting query 
performance. In principle, we expect that in the future a big share of this data will be 
expressed in XML, thus making the problem of compression very appealing. 

Finally, for what concerns information retrieval systems, [15] exploits a variant 
of Huffman (extended to “bytes” instead of bits) in order to execute phrase matching 
entirely in the compressed domain. However, querying the text is obviously only a subset 
of the XQuery features. In particular, theta-joins are not feasible with the above variant 
of Huffman, whereas they can be executed by means of order-aware ALM. 
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1.3 Organization 

The paper is organized as follows. In Section 2, we motivate the choice of our storage 
structures for compressed XML, and present ALM [12] and other compression algo- 
rithms, which we use for compressing the containers. Section 3 outlines the cost model 
used for partitioning the containers into sets, and for identifying the right compression 
to be applied to the values in each container set. Section 4 describes the XQueC query 
processor, its set of physical operators, and outlines its optimization algorithm. Section 5 
shows the performance measures of our system on several data sets and XQuery queries. 



2 Compressing XML Documents in a Queryable Format 

In this section, we present the principles behind our approach for storing compressed 
XML documents, and the resulting storage model. 

2.1 Compression Principles 

In general, we make the observation that within XML text, strings represent a large per- 
centage of the document, while numbers are less frequent. Thus, compression of strings, 
when effective, can truly reduce the occupancy of XML documents. Nevertheless, not all 
compression algorithms can seamlessly afford string comparisons in the compressed do- 
main. In our system, we include both order-preserving and order-agnostic compression 
algorithms, and the final choice is entrusted to a suitable cost model. 

Our approach for compressing XML was guided by the following principles: 
Order-agnostic compression. As an order-agnostic algorithm, we chose classical Huff- 
man 3 , which is universally known as a simple algorithm which achieves the best possible 
redundancy among the resulting codes. The process of encoding and decoding is also 
faster than universal compression techniques. Finally, it has a set of fixed codewords, 
thus strings compressed with Huffman can be compared in the compressed domain 
within equality predicates. However, inequality predicates need to be decompressed. 
That is why in XQueC we may exploit order-preserving compression as well as not 
order-preserving one. 

Order-preserving compression. Whereas everybody knows the potentiality of Huff- 
man, the choice of an order-preserving algorithm is not immediate. We had initially 
three choices for encoding strings in an order-preserving manner: the Arithmetic [16], 
Hu-Tucker [ 17] and ALM [12] algorithms. We knew that dictionary-based encoding has 
demonstrated its effectiveness w.r.t. other non-dictionary approaches [18] while ALM 
has outperformed Hu-Tucker (as described in [19]). The former being both dictionary- 
based and efficient, was a good choice in our system. ALM has been used in relational 
databases for blank-padding (i.e. in Oracle) and for indexes compression. Due to its 
dictionary-based nature, ALM decompresses faster than Huffman, since it outputs big- 
ger portions of a string at a time, when decompressing. Moreover, ALM seamlessly 
solved the problem of order-preserving dictionary compression, raised by encodings 

3 Here and in the remainder of the paper, by Huffman we shall mean solely the classical Huffman 
algorithm [11], thus disregarding its variants. 
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Fig. 2. An example of encoding in ALM. 



such as Zilch encoding, string prefix compression and composite key compression by 
improving each of these. To this purpose, ALM eliminates the prefix property exhibited 
by those former encodings by allowing in the dictionary more than one symbol for the 
same prefix. 

We now provide a short overview of how the ALM algorithm works. The fundamental 
mechanics behind the algorithm tells to consider the original set of source substrings, 
to split it into disjunct partitioning intervals set and to associate an interval prefix to 
each partitioning interval. For example, Figure 2 shows the mapping from the original 
source (made of the strings there, their, these) into some partitioning intervals 
and associated prefixes, which clearly do not scramble the original order among the 
source strings. We have implemented our own version of the algorithm, and we have 
obtained encouraging results w.r.t. previous compression- aware XML processors (see 
Section 5). 

Workload-based choices of compression. Among the possible predicates writable in 
an XQuery query, we distinguish among the inequality, equality and wildcard. The ALM 
algorithm [12] allows inequality and equality predicates in the compressed domain, but 
not wildcards, whereas Huffman [11] supports prefix-wildcards and equality but not 
inequality. Thus, the choice of the algorithm can be aided by a proper query workload, 
whenever this turns to be available. In case, instead, the workload has not been provided, 
XQueC uses ALM for strings and decompresses the compared values in case of wildcard 
operations. 

Structures for algebraic evaluation. Containers in XQueC closely resemble B+trees 
on values. Moreover, a light-weight structure summary allows for accessing the structure 
tree and the data containers in the query evaluation process. Data fragmentation allows 
for better exploiting all the possible evaluation plans, i.e. bottom-up, top-down, hybrid or 
index-based. As shown below, several queries of the XMark benchmark take advantage 
of the XQueC appropriate structures and of the consequent flexibility in parsing and 
querying these compressed structures. 

2.2 Compressed Storage Structures 

The XQueC loader/compressor parses and splits an XML document into the data struc- 
tures depicted in Figure 1. 

Node name dictionary. We use a dictionary to encode the element and attribute names 
present in an XML document. Thus, if there are N t distinct names, we assign to each of 
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Fig. 3. Storage structures in the XQueC repository 



them a bit string of length log 2 {N t ). For example, the XMark documents use 92 distinct 
names, which we encode on 7 bits. 

Structure tree. We assign to each non- value XML node (element or attribute) an unique 
integer ID. The structure tree is stored as a sequence of node records, where each record 
contains: its own ID, the corresponding tag code; the IDs of its children; and (redun- 
dantly) the ID of its parent. For better query performance, as an access support structure, 
we construct and store a B+ search tree on top of the sequence of node records. Finally, 
each node record points to all its attribute and text children in their respective containers. 

Value containers. All data values found under the same root-to-leaf path expression in 
the document are stored together into homogeneous containers. A container is a sequence 
of container records, each one consisting of a compressed value and a pointer to parent 
of this value in the structure tree. Records are not placed in the document order, but 
in a lexicographic order, to enable fast binary search. Note that container generation 
as done in XQueC is reminiscent of vertical partitioning of relational databases [20]. 
This kind of partitioning guarantees random access to the document content at different 
points, i.e. the containers access points. This choice provides interesting query evaluation 
strategies and leads to good query performance (see Section 5). Moreover, containers, 
even if kept separated, may share the same source model or, they can be compressed 
with different algorithms if not involved in the same queries. This is decided by a cost 
analysis which exploits the query workload and the similarities among containers, as 
described in Section 3. 

Structure summary. The loader also constructs, as a redundant access support structure, 
a structural summary representing all possible paths in the document. For tree-structured 
XML documents, it will always have less nodes than the document (typically by several 
orders of magnitude). A structural summary of the auction documents can be derived 
from Figure 1 , by (i) omitting the dashed edges, which brings it to a tree form, and (li) 
storing in each non-leaf node in Figure 3, accessible in this tree by a path p, the list 
of nodes reachable in the document instance by the same path. Finally, the leaf nodes 
of our structure summary point to the corresponding value containers. Note that the 
structure summary is very small, thus it does not affect the compression rate. Indeed, in 
our experiments on the corpus of XML documents described in Section 5, the structure 
summary amounts to about 19% of the original document size. 

Other indexes and statistics. When loading a document, other indexes and/or statistics 
can be created, either on the value containers, or on the structure tree. Our loader pro- 
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totype currently gathers simple fan-out and cardinality statistics (e.g. number of person 
elements). 

To measure the occupancy of our structures, we have used a set of documents pro- 
duced by means of the xml gen generator of the XMark project and ranged from 1 15 KB 
to 46MB. They have been reduced by an average factor of 60% after compression (these 
figures include all the above access structures). 

Our proposed storage structure is the simplest and most compact one that fulfills the 
principles listed at the beginning of Section 2; there are many ways to store XML in 
general [21]. If we omit our access support structures (backward edges, B+ index, and 
the structure summary), we shrink the database by a factor of 3 to 4, albeit at the price 
of deteriorated query performance. 

Any storage mechanism for XML can be seamlessly adopted in XQueC, as long as 
it allows the presence of containers and the facilities to access container items. 

2.3 Memory Issues 

Data fragmentation in XQueC guarantees a wide variety of query evaluation strategies, 
and not solely top-down evaluation as in homomorphic compressors [4], [5]. Instead 
of identifying at compile-time the parts of the documents necessary for query evalua- 
tion, as given by an XQuery projection operator [10], in XQueC the path expressions 
are hard-coded into the containers and projection is already prepared in advance when 
compressing the document, without any additional effort for the loader. Consider as 
examples the following query Q14 of XMark: 

FOR $i IN document("auction.xml")/site//item 

WHERE CONTAINS($i/description,"gold") 

RETURN $i/name/text() 

This query would require prohibitive parsing times in XGrind and XPRESS, which 
basically have to load into main-memory all the document and parse it entirely in order 
to find the sought items. For this query, as shown in Figure 4, all the XML stream has to 
be parsed to find the elements <item>. 

In XQueC, the compressor has already shredded the data and accessibility to these 
data from the structure summary allows to save the parsing and loading times. Thus, in 
XQueC the structure summary is parsed (not all the structure tree), then the involved 
containers are directly accessed (or alternatively their selected single items) and loaded 
into main-memory. More precisely, as shown in Figure 4, once the structure summary 
leads to the containers C\, Ci and C 3 , only these (or part of them) need to be fetched 
in memory. Finally, note that in Galax, extended with the projection operator [TO], the 
execution times for queries involving the descendant-or-self axis (such as XMark Q14) 
are significantly increased, since additional complex computation is demanded to the 
loader for those queries. 

3 Compression Choices 

XQueC exploits the query workload to choose the way containers are compressed. As 
already highlighted, the containers are filled up with textual data, which represents a big 
share of the whole documents. Thus, achieving a good trade-off between compression 
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Fig. 4. Accesses to containers in case of XMark’s Q14 with descendant-or-self axis in 
XPress/XGrind versus XQueC. 



ratio and query execution times, must necessarily imply the capability to make a good 
choice for textual container compression. 

First, a container may be compressed with any compression algorithm, but obviously 
one would like to apply a compression algorithm with nice properties. For instance, the 
decompression time for a given algorithm strongly influences the times of queries over 
data compressed with that algorithm. In addition, the compression ratio achieved by a 
given algorithm on a given container influences the overall compression ratio. 

Second, a container can be compressed separately or can share the same source model 
with other containers. The latter choice would be very convenient whenever for example 
two containers exhibit data similarities, which will improve their common compression 
ratio. Moreover, the occupancy of the source model is as relevant in the choice of the 
algorithm as the occupancy of containers. 

To understand the impact of compression choices, consider two binary-encoded 
containers, ct\ and cf^- ct\ contains only strings composed of letters a and b, whereas 
ct 2 contains only strings composed of letters c and d. Suppose, as one extreme case, that 
two separate source models are built for the two containers; in such a case, containers 
are encoded with 1 bit per letter. As the other extreme case, a common source model is 
used for both containers, thus requiring 2 bits per letter for the encoding, and increasing 
the containers occupancy. This scenario may get even more complicated when we think 
of an arbitrary number of encodings assigned to each container. This smallish example 
already shows that compressing several containers with the same source model leads to 
losses in the compression ratio. 

In the sequel, we show how our system addresses these problems, by proposing a 
suitable cost model, a greedy algorithm for making the right choice, and some experi- 
mental results. The cost model of XQueC is based on the set of non-numerical (textual) 
containers, the set of available compression algorithms A, and the query workload W. 
As it is typical of optimization problems, we will characterize the search space, define 
the cost function, and finally propose a simple search strategy. 

3.1 Search Space: Possible Compression Configurations 

Let C be the set of containers built from a set of documents V. A compression configu- 
ration s for T> is denoted by a tuple < P, alg> where P is a partition of C’s elements, 
and the function alg : P —> A associates a compression algorithm with each set p in the 
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partition P. The configuration s dictates thus that all values of the containers in p will 
be compressed using alg(p), and a single common source model. Moreover, let V be 
the set of possible partitions of C. The cardinality of V is the Bell number B\ c \, which 
is exponential with \C\. For each possible partition Pi £ V , there are |.4|l Pi l ways of 
assigning a compression algorithm to each set in Pi. Therefore, the size of the search 
space is: ^J.^ 1 |_4|l Pi l, which is exponential in |_4|. 



3.2 Cost Function: Appreciating the Quality of a Compression Configuration 

Intuitively, the cost function for a configuration s reflects the time needed to apply 
the necessary data decompressions in order to evaluate the predicates involved in the 
queries of VV. Reasonably, it also accounts for the compression ratios of the employed 
compression algorithms, and it includes the cost of storing the source model structures. 
The cost of a configuration s is an integer value computed as a weighted sum of storage 
and decompression costs. 



Characterization of compression algorithms. Each algorithm a £ A is denoted by a tu- 
ple < d c , c s (F), c a (F), eq, ineq , wild >. The decompression cost d c is an estimate of 
the cost of decompressing a container record by using a, the storage cost c s (F) is a func- 
tion estimating the cost of storing a container record compressed with a, and the storage 
cost of the source model structures c a (F) is a function estimating the cost of storing 
the source model structures for a container record. F is a symmetric similarity matrix 
whose generic element F[i, j] is a real number ranging between 0 and 1, capturing the 
normalized similarity between a container cti and a container ctj . F is built on the basis 
of data statistics, such as the number of overlapping values, the character distribution 
within the container entries, and possibly other type information, whenever available 
(e.g. the XSchema types, using results presented in [22]) 4 . Finally, the algorithmic 
properties eq, ineq and wild are boolean values indicating whether the algorithm sup- 
ports in the compressed domain: (i) equality predicates without prefix-matching (eq), (ii) 
inequality predicates without prefix-matching (ineq) and (Hi) equality predicates with 
prefix-matching (wild). For instance, Huffman will have eq = true, ineq = false and 
wild = true, while ALM will have eq = true, ineq = true and wild = false. We 
denote each parameter of algorithm a with an array notation, e.g., a[eq] . 

Storage costs. The containers and source model storage costs are simply computed as 
SpGP ( a ^(F)[c(-fp)] * J2c£ P l c l) w h ere c = c s for the case of container storage and 

c = c a for source model storage 5 . Obviously, c s and c 0 need not to be evaluated on 
the overall F but solely on F p , that is the projection of F over the containers of the 
partition p. 

4 We do not delve here into the details of F as study of similarity among data is outside the scope 
of this paper. 

5 We are not considering here the containers that are not involved in any query in W . Those do 
not incur a cost so they can be disregarded in the cost model. 
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Decompression cost. In order to evaluate the decompression cost associated with a given 
compression configuration s, we define three square matrices, E, I and D, having size 
(|C| + 1) x (|C| + 1). These matrices reflect the comparisons (equality, inequality and 
prefix-matching equality comparisons, respectively) made in W among container values 
or between container values and constants. More formally, the generic element E l3 , with 
i ^ \C\ + 1 and j ^ \C\ + 1, is the number of equality predicates in W between cti 
and ctj not involving prefix-matching, whereas with i = \C\ + 1 or j = \C\ + 1, it is 
the number of equality predicates in W between cti and a constant (if j = \C\ + 1), or 
between ctj and a constant (if i — \C \ + 1), not involving prefix-matching. Matrices I 
and D have the same structure but refer to inequality and prefix-matching comparisons, 
respectively. Obviously, E, I and D are symmetric. 

Considering the generic element of the three matrices, say M[i, j], the associated 
decompression cost is obviously zero if cf, and ctj share the same source model and 
the algorithm they are compressed with supports the corresponding predicate in the 
compressed domain. A decompression cost occurs in three cases: (i) cti and ctj are 
compressed using different algorithms; (ii) cti and ctj are compressed using the same 
algorithm but different source models; (Hi) cti and ctj share the same source model 
but the algorithm does not support the needed comparison (equality in the case of E, 
inequality for I and prefix-matching for D ) in the compressed domain. For instance, for 
a generic element I[i, j], in the case of i j, i ^ \C\ + 1 and j ^ \C\ + 1, the cost 
would be: 

- zero, if cti £ P , ctj £ P, alg{p)[ineq\ = true; 

- \cti\ * alg(p')[d c ] + \ctj\ * alg(p")[d c \, if cti £ p' , ctj £ p" , p' ^ p" (cases (i) and 

(»)); 

- (\cti\ + \ctj\) * alg(p)[d c ], if cti £ Pi ctj £ p, alg(p)[ineq\ = false (case (iii)). 

The decompression cost is calculated by summing up the costs associated with each 
element of the matrices E, /, and D. However, note that (i) for the cases of E and D, 
we consider alg(p)[eq] and alg(p)[wild\, respectively, and that (ii) the term referring 
to the cardinality of the containers to be decompressed is adjusted in the cases of self- 
comparisons (i.e. i = j) and comparisons with constants ( i = \C\ + 1 or j = \C\ + 1). 



3.3 Devising a Suitable Search Strategy 

XQueC currently uses a greedy strategy for moving into the search space. The search 
starts with an initial configuration s 0 =< Pq, algo >> where P 0 is a partition of C having 
sets of exactly one container, and algo blindly assigns to each set a generic compression 
algorithm (e.g. bzip) and a separate source model. Next, so is gradually improved by a 
sequence of configuration moves. 

Let Pred be the set of value comparison predicates appearing in VV. A move from 
Sfc =< Pk,algk > to Sfc +1 =< Pk+i,algk+i > is done by first randomly extracting 
a predicate pred from Pred. Let cti and ctj be the containers involved in pred (for 
instance pred makes an equality comparison, such as cti — ctj, or an inequality one, 
such as cti > ctj). Let p' andp" the sets in Pk to which cti and ctj belong, respectively. If 
p' = p" , we build a new configuration s' where algk+i(p') is such that the evaluation of 
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pred is enabled on compressed values, and algk+i has the greatest number of algorithmic 
properties holding true. Then, we evaluate the costs of Sk and s', and let .s/.. +1 be the one 
with the minimum cost. In the case of p' / p" , we build two new configurations s' and 
s" . s' is obtained by dropping cti and ctj from p' and p" , respectively, and adding the set 
{cti , ctj } to Pk+i ■ s" is obtained by replacing p' and p" with their union. For both s' and 
s" , algk+i associates to the new sets in Pk+i an algorithm enabling the evaluation of 
pred in the compressed domain and having the greatest number of algorithmic properties 
holding true. Finally, we evaluate the costs of Sk, s' and s'', and let be the one 
with the minimum cost. 

Example. To give a flavor of the savings gained with partitioning the set of containers, 
consider an initial configuration, which has five containers on an XMark document, all 
of them sized about 6MB, which we initially (naively) choose to compress with ALM 
only; let us call this configuration NaiveConf. The workload is made of XQuery queries 
with inequality predicates over the path expressions leading to the above containers. 
The first three containers are filled with Shakespeare’s sentences, the fourth is filled 
with person names and the fifth with dates. Using the above workload, we obtain the 
best partitioning, which has three partitions, one with the first three containers and a 
distinct partition for the fourth and fifth, let us call it GoodConf. The compression factor 
shifts from 56.14% for the NaiveConf to 67.14%, 71.75% and 65.15% respectively for 
the three partitions of GoodConf. While in such a case the source models sizes do not 
vary significantly, the decompression cost in Good Conf is clearly advantageous w.r.t. 
NaiveConf leading to gain 21.42% for Shakespearian text, 28.57% for person names 
and to loose only 6%i for dates. □ 

Note that, for each predicate in Pred , the strategy explores a fixed subset of possible 
configuration moves, so its complexity is linear in \Pred\. Of course, due to this partial 
exploration, the search yields a locally optimal solution. Moreover, containers not in- 
volved in W are not considered by the cost model, and a reasonable choice could be to 
compress them using order-unaware algorithms offering good compression ratios, e.g. 
bzip2 [23]. Finally, note also that the choice of a suitable compression configuration is 
orthogonal with respect to the choosing of an optimal XML storage model [22]; we can 
combine both for an automatic storage-and-compression design. 



4 Evaluating XML Queries over Compressed Data 



The XQueC query processor consists of a query parser, an optimizer, and a query eval- 
uation engine. The set of physical operators used by the query evaluation engine can be 
divided in three classes: 

- data access operators , retrieving information from the compressed storage struc- 
tures; 

- regular data combination operators (joins, outer joins, selections etc.); 

- compression and decompression operators. 




Efficient Query Evaluation over Compressed XML Data 



213 



XMLSe|ialize 

decompress(person name, item name) 

T extContent(name) 



TextContent(item) 
Child(item -> name) 
Child(person -> name) 



LeftOuterJoin( @ id= @ person) 
Implem: merge 



ContainerScan("/site/people/personr@id") 



MergeJoin(buyer/@=item/@ id) 



ContainerScan(7site/closed_auctions/ 

/closed_auction/buyer/@ person") 

Child(closed_auction -> buyer) 



Parent(item_ref -> closed_auction) 



MergeJoin(@id=@person) 



ContainerScan("/site/regions/europe/item/@id") 




ContainerScanf "/site/closed auctions/closed auction/item ref/@item) 



Fig. 5. Query execution plan for XMark's Q9. 



Among our data access operators, there are ContScan and ContAccess , which allow, 
respectively, to scan all (elementID, compressed value) pairs from a container, and to 
access only some of them, according to an interval search criteria. StructureSummary- 
Access provide direct access to the identifiers of all elements reachable through a given 
path. Parent and Child allow to fetch the parent, respectively, the children (all children, 
or just those with a specific tag) for a given set of elements. Finally, TextContent pairs 
element IDs with all their immediate text children, retrieved from their respective con- 
tainers. TextContent is implemented as a hash join pairing the element IDs with the 
content obtained from a ContScan. 

Due to the storage model chosen in XQueC (Section 2.2), the StructureSummaryAc- 
cess operator provides the identifiers of the required elements in the correct document 
order. Furthermore, the Parent and Child operator preserve the order of the elements 
with respect to which they are applied. Also, if the Child operator returns more than one 
child for a given node, these children are returned in correct order. The order-preserving 
behavior allow us to perform many path computations through comparatively inexpen- 
sive 1-pass merge joins; furthermore, many simple queries can be answered without 
requiring a sort to re-constitute document order. 

While these operators respect document order, ContScan and ContAccess respect 
data order, provides fast access to elements (and values) according to a given value 
search criteria. Also, as soon as predicates on container values are given in the query, it 
is often profitable to start query evaluation by scanning (and perhaps merge-joining) a 
few containers. 
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As an example of QEP, consider query Q9 from XMark: 

FOR $p IN document("auction.xmr')/site/people/person 
LET $a := 

FOR $t IN document("auction.xml")/site/ 
closed_auctions/closed_auction 
LET $n := 

FOR $t2 IN document("auction.xml")/site/ 
regions/europe/item 
WFIERE $t/itemref/@item = $t2/@id 
RETURN $t2 

WFIERE $p/@id = $t/buyer/@ person 
RETURN <item> $n/name/text() </item> 

RETURN <person name=$p/name/text() > $a </person> 

Figure 5 shows a possible XQueC execution plan for Q9 (this is indeed the plan used 
in the experiments). Based on this example, we make several remarks. First, note that we 
only decompress the necessary pieces of information (person name and item name), only 
at the very end of the query execution (the decompress operators shown in bold fonts). All 
the way down in the QEP, we were able to compute the three-ways join between persons, 
buyers, and items, using directly the compressed attributes person/@id, buyer/@ person, 
and item_ref/@item. Second, due to the order of data obtained from ContainerScans, 
we are able to make extensive use of MergeJoins, without the need for sorting. Third, 
this plan mixes Parent and Child operators, alternating judiciously between top-down 
and bottom-up strategy, in order to minimize the number of tuples manipulated at any 
particular moment. This feature is made possible by the usage of a full set of algebraic 
evaluation choices, which XQueC has, but is not available to the XGrind or XPress query 
processors. 

Finally, note that for instance in query Q9 also an XMLSerialize operator is employed 
in order to correctly construct the new XMF which the query outputs. To this purpose, we 
recall that XML construction plays a minor role within the XML algebraic evaluation, 
and, being not crucial, it can be disregarded in the whole query execution time [24]. This 
has been confirmed by our experiments. 

5 Implementation and Experimental Evaluation 

XQueC is being implemented entirely in Java, using as back-end an embedded database, 
Berkeley DB [25] . We have performed some interesting comparative measures, that show 
that XQueC is a competitor of both query-aware compressors, and of early XQuery 
prototypes. 

In the following, we want to illustrate both XQueC good compression ratios and 
query execution times. To this purpose, we have done two kinds of experiments: 

Compression Factors. We have performed experiments on both synthetic data (XMark 
documents) and on real-life data sets (in particular, we considered the ones chosen 
in [5] for the purpose of cross-comparison with it). 

Query Execution Times. We show how our system performs on some XML benchmark 
queries [8] and cross-compare them with the query execution times of optimized 
Galax [10], an open-source XQuery prototype. 

All the experiments have been executed on a DELL Latitude C820 laptop equipped 
with a 2,20GHz CPU and 512MB RAM. 
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Table 1. Data Sets used in the experiments (XMarkl 1 is used in QETs measures.) 



Document 


Size(MB) 


Containers 


Distinct tags 


Tree nodes 


Shakespeare 


15.0 


39 


22 


65621 


Baseball 


16.8 


41 


46 


27181 


Washington-course 


12.1 


12 


18 


99729 


XMarkl 1 


11.3 


432 


77 


76726 




Fig. 6. Average CF for Shakespeare, WashingtonCourse and Baseball data sets (left); and for 
XMark synthetic data sets (right). 



Compression Factors. We have compared the obtained compression factors (defined as 
1 — (cs/ os) ), where cs and os are the sizes of the compressed and original documents, re- 
spectively) with the corresponding factors of XMill, XGrind and XPRESS. Figure 6 (left) 
shows the average compression factor obtained for a corpus of documents composed 
of Shakespeare. xml, Washington-Course.xml and Baseball.xml, whose main character- 
istics are shown in Table 1. Note that, on average, XQueC closely tracks XPRESS. It is 
interesting to notice that some limitations affect some of the XML compressors that we 
tested - for example, the documents decompressed by XPRESS have lost all their white 
spaces. Thus, the XQueC compression factor could be further improved if blanks were 
not considered. 

Moreover, we have also tested the compression factors on different-sized XMark 
synthetic data sets (we considered documents ranging from 1MB to 25MB), generated 
by means of xmlgen [8]. As Figure 6 (right) shows, we have obtained again good com- 
pression factors w.r.t XPRESS and XMill. 

Note also that XGrind does not appear in these experiments. Indeed, due to repetitive 
crashes, we were not able to upload in the XGrind system (the version available through 
the site http://sourceforge.net ) any XMark document except for one sized 100KB, whose 
compression factor however is very low and not representative of the system (precisely 
equal to 17.36%). 



Query Execution Times. We have tested our system against the optimized version of 
Galax by running XMark queries and other queries. Due to space limits, we select here a 
set of significant XMark queries. Indeed, XMark queries left out stress language features, 
on which compression will likely have no significant impact whatsoever, e.g., support 
for functions, deep nesting etc. The reasons why we chose Galax is that it is open-source 
and has an optimizer. Note that the XQueC optimizer is not finalized yet (and was indeed 
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| ■ Galax ■XQueC| 




Fig. 7. Comparative execution times between us and Optimized Galax. 



not used in these measures), thus our results are only due to the compression ratio, data 
structures, and efficient execution engine. 

Figure 7 shows the executions of XQueC queries on the document XMarkl 1, sized 
11.3MB. For the sake of better readability, in Figure 7, we have omitted Q9, and Q8. 
These queries measured in our system 2.133 sec. and 2.142 sec. respectively, whereas 
in Galax Q9 could not be measured on our machine 6 and Q8 took 126.33 sec. Note 
also that on Q2, Q3, Q16, the QET is a little worse than the Galax one, because in the 
current implementation we use simple unique IDs, given that our data model imposes 
a large number of parent-child joins. However, even with this limitation, we are still 
reasonably close to Galax, and we expect much better once XQueC will migrate to 3- 
valued IDs, as already started in the spirit of [26], [27], [28]. Most importantly, note 
that the previous XQueC QETs are to be intended as the times taken to both execute the 
queries in the compressed and decompress the obtained results. Thus, those measures 
show that there is no performance penalty in XQueC w.r.t. Galax due to compression. 
Thus, with comparable times w.r.t. an XQuery engine over uncompressed data, XQueC 
exhibits the advantage of compression. 

As a general remark, note that our system is implemented in Java and can be made 
faster by using a native code compiler, which also we plan to plug in the immediate 
future. 

Finally, it is worth noting that comparison of XQueC with XGrind and XPress query 
times could not be done due to the fact that fully working versions of the latters are 
not publicly available. Nevertheless, that comparison would have been less meaningful, 
since those systems cover a limited fragment of XPath, and not full XQuery, as discussed 
in Section 1.2. 



6 The same query has been tested on a more powerful machine in the paper [10] and results in a 
rather lengthy computation. 
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6 Conclusions and Future Work 

We have presented XQueC, a compression-aware XQuery processor. We have shown that 
our system exhibits a good trade-off between compression factors over different XML 
data sets and query evaluation times on XMark queries. XQueC works on compressed 
XML documents, which can be a huge advantage when query results must be shipped 
around a network. 

In the very near future, our system will be improved in several ways: by moving to 
three-valued IDs for XML elements, in the spirit of [26], [27], [28] and by incorporating 
further storage techniques that lead to additionally reduce the occupancy of structures. 
The implementation of an XQuery [29] optimizer for querying XML compressed data 
is ongoing. Moreover, we are testing the suitability of our system w.r.t. the full-text 
queries [30], which are being defined for the XQuery language at W3C. Another impor- 
tant extension we have devised is needed for uploading in our system larger documents 
than currently (e.g. SwissProt, measuring about 500MB). To this purpose, we plan to 
access the containers during the parsing phase directly on secondary storage rather than 
in memory. 
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Abstract. XML makes data flexible in representation and easily 
portable on the Web but it also substantially inflates data size as a 
consequence of using tags to describe data. Although many effective 
XML compressors, such as XMill, have been recently proposed to solve 
this data inflation problem, they do not address the problem of running 
queries on compressed XML data. More recently, some compressors have 
been proposed to query compressed XML data. However, the compres- 
sion ratio of these compressors is usually worse than that of XMill and 
that of the generic compressor gzip, while their query performance and 
the expressive power of the query language they support are inadequate. 
In this paper, we propose XQzip, an XML compressor which supports 
querying compressed XML data by imposing an indexing structure, 
which we call Structure Index Tree (SIT), on XML data. XQzip addresses 
both the compression and query performance problems of existing XML 
compressors. We evaluate XQzip’s performance extensively on a wide 
spectrum of benchmark XML data sources. On average, XQzip is able 
to achieve a compression ratio 16.7% better and a querying time 12.84 
times less than another known queriable XML compressor. In addition, 
XQzip supports a wide scope of XPath queries such as multiple, deeply 
nested predicates and aggregation. 



1 Introduction 

XML has become the de facto standard for data exchange. However, its flexibility 
and portability are gained at the cost of substantially inflated data, which is a 
consequence of using repeated tags to describe data. This hinders the use of 
XML in both data exchange and data archiving. In recent years, many XML 
compressors have been proposed to solve this data inflation problem. There are 
two types of compressions: unqueriable compression and queriable compression. 

The unqueriable compression, such as XMill [8], makes use of the similarities 
between the semantically related XML data to eliminate data redundancy so 
that a good compression ratio is always guaranteed. However, in this approach 
the compressed data is not directly usable; a full chunk of data must be first 
decompressed in order to process the imposed queries. 



E. Bertino et al. (Eds.): EDBT 2004, LNCS 2992, pp. 219-236, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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1. <site> 

2. <open_auctions> 

3. <open_auction id="openl"> 

4. <initial>$12.00</initial> 

5. <bid> 

6. <date>12/02/2000</date> 

7. <increase>$2.00</increase> 

8. </bid> 

9. <bid> 

10. <date> 1 2/03/2000</date> 



11. <increase>$1.50</increase> 

12. </bid> 

1 3 . <seller person="person7 1 "/> 

14. </open_auction> 

15. <open_auction id="open2"> 

16. <imtial>$500.00</initial> 

17. <seller person="person8"/> 

18. </open_auction> 

19. <open_auction id="open3"> 

20. <initial>$1.50</initial> 



21. <bid> 

22. <date>l l/29/2002</date> 

23. <increase>$0.50</increase> 

24. </bid> 

25. <seller person="personl5"/> 

26. </open_auction> 

27. <open_auction id="open4"> 

28. <initial>$100.00</initial> 

29. <seller person="person 1 1 "/> 

30. </open_auction> 



31. <open_auction id="open5"> 

32. <initial>$8.50</initial> 

33. <bid> 

34. <date>08/20/2002</date> 

35. <increase>$5.00</increase> 

36. </bid> 

37. <seller person="person7"/> 

38. </open_auction> 

39. </open_auctions> 

40. </site> 



Fig. 1. A Sample Auction XML Extract 




Fig. 2. Structure Tree (contents of the exts not shown) of the Auction XML Extract 




Fig. 3. SIT of the Auction Structure Tree 



The queriable compression encodes each of the XML data items individually 
so that the compressed data item can be accessed directly without a full de- 
compression of the entire file. However, the fine-granularity of the individually 
compressed data unit does not take advantage of the XML data commonalities 
and, hence, the compression ratio is usually much degraded with respect to the 
full-chunked compression strategy used in unqueriable compression. 

The queriable compressors, such as XGrind [14] and XPRESS [10], adopts 
homomorphic transformation to preserve the structure of the XML data so that 
queries can be evaluated on the structure. However, the preserved structure is 
always too large (linear in the size of the XML document). It will be very in- 
efficient to search this large structure space, even for simple path queries. For 
example, to search for bidding items with an initial price under $10 in the com- 
pressed file of the sample XML extract shown in Fig. 1, XGrind parses the entire 
compressed XML document and, for each encoded element/attribute parsed, it 
has to match its incoming path with the path of the input query. XPRESS makes 
an improvement as it reduces the element-by-element matching to path-by-path 
matching by encoding a path as a distinct interval in [0.0, 1.0), so that a path can 
be matched using the containment relationships among the intervals. However, 
the patlr-by-path matching is still inefficient since most paths are duplicate in 
an XML document, especially for those data-centric XML documents. 
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Contributions. We propose XQzip, which has the following desirable features: 
(1) achieves a good compression ratio and a good compression/decompression 
time; (2) supports efficient query processing on compressed XML data; and (3) 
supports an expressive query language. XQzip provides feasible solutions to the 
problems encountered with the queriable and unqueriable compressions. 

Firstly, XQzip removes the duplicate structures in an XML document to 
improve query performance by using an indexing structure called the Structure 
Index Tree (or SIT). An example of a SIT is shown in Fig. 3, which is the index 
of the tree in Fig. 2, the structure of the the sample XML extract in Fig. 1. 
Note that the duplicate structures in Fig. 2 are eliminated in the SIT. In fact, 
large portions of the structure of most XML documents are redundant and can 
be eliminated. For example, if an XML document contains 1000 repetitions of 
our sample XML extract (with different data contents), the corresponding tree 
structure will be 1000 times bigger than the tree in Fig. 2. However, its SIT will 
essentially have the same structure as the one in Fig. 3, implying that the search 
space for query evaluation is reduced 1000 times by the index. 

Secondly, XQzip avoids full decompression by compressing the data into a 
sequence of blocks which can be decompressed individually and at the same 
time allow commonalities of the XML data to be exploited to achieve a good 
compression. XQzip also effectively reduces the decompression overhead in query 
evaluation by managing a buffer pool for the decompressed blocks of XML data. 

Thirdly, XQzip utilizes the index to query the compressed XML data. XQzip 
supports a large portion of XPath [15] queries such as multiple and deeply nested 
predicates with mixed value-based and structure-based query conditions, and 
aggregations; and it extends an XPath query to select an arbitrary set of distinct 
elements with a single query. We also give an easy mapping scheme to make the 
verbose XPath queries more readable. In addition, we devise a simple algorithm 
to evaluate the XPath [15] queries in polynomial time in the average-case. 

Finally, we evaluate the performance of XQzip on a wide variety of benchmark 
XML data sources and compare the results with XMill, gzip and XGrind for 
compression and query performance. Our results show that the compression ratio 
of XQzip is comparable to that of XMill and approximately 16.7% better than 
that of XGrind. XQzip’s compression and decompression speeds are comparable 
to that of XMill and gzip, but several times faster than that of XGrind. In query 
evaluation, we record competitive figures. On average, XQzip evaluates queries 
12.84 times faster than XGrind with an initially empty buffer pool, and 80 times 
faster than XGrind with a warm buffer pool. In addition, XQzip supports efficient 
processing of many complex queries not supported by XGrind. Although we are 
not able to compare XPRESS directly due to the unavailability of the code, we 
believe that both our compression and query performance are better than that of 
XPRESS, since XPRESS only achieves a compression ratio comparable to that 
of XGrind and a query time 2.83 times better than that of XGrind, according 
to XPRESS’s experimental evaluation results [10]. 

Related Work. We are also aware of another XML compressor, XQueC [2], 
which also supports querying. XQueC compresses each data item individually 
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and this usually results in a degradation in the compression ratio (compared to 
XMill) . An important feature of XQueC is that it supports efficient evaluation 
of XQuery [16] by using a variety of structure information, such as dataguides 
[5], structure tree and other indexes. However, these structures, together with 
the pointers pointing to the individually compressed data items, would incur 
huge space overhead. Another queriable compression is also proposed recently in 
[3], which compresses the structure tree of an XML document to allow it to be 
placed in memory to support Core XPatlr [6] queries. This use of the compressed 
structure is similar to the use of the SIT in XQzip, i.e. [3] condenses the tree 
edges while the SIT indexes the tree nodes. [3] does not compress the textual 
XML data items and hence it cannot be served as a direct comparison. 

This paper is organized as follows. We outline the XQzip architecture in 
Section 2. Section 3 presents the SIT and its construction algorithm. Section 4 
describes a queriable, compressed data storage model. Section 5 discusses query 
coverage and query evaluation. We evaluate the performance of XQzip in Section 
6 and give our concluding remarks and discuss our future work in Section 7. 



2 The Architecture of XQzip 

The architecture of XQzip consists of four main modules: the Compressor, the 
Index Constructor, the Query Processor , and the Repository. A simplified dia- 
gram of the architecture is shown in Fig. 4. We describe the operations related 
to the processes of compression and querying. 

For the compression process, the input XML document is parsed by the SAX 
Parser which distributes the XML data items (element contents and attribute 
values) to the Compressor and the XML structure (tags and attributes) to the 
Index Constructor. The Compressor compresses the data into blocks which can 
be efficiently accessed from the Hashtable where the element /attribute names 
are stored. The Index Constructor builds the SIT for the XML structure. 

For the querying process, the Query Parser parses an input query and then 
the Query Executor uses the index to evaluate the query. The Executor checks 
with the Buffer Manager, which applies the LRU rule to manage the Buffer Pool 
for the decompressed data blocks. If the data is already in the Buffer Pool, the 
Executor retrieves it directly without decompression. Otherwise, the Executor 
communicates with the Hashtable to retrieve the data from the compressed file. 



XQzip Repository 




Query 

'Result 



Fig. 4. Architecture of XQzip 
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3 XML Structure Index Trees (SITs) 

In this section we introduce an effective indexing structure called a Structure 
Index Tree (or a SIT) for XML data. We first define a few basic terminologies 
used to describe the SIT and then present an algorithm to generate the SIT. 



3.1 Basic Notions of XML Structures 

We model the structure of an XML document as a tree, which we call the struc- 
ture tree. The structure tree contains only a root node and element nodes. The 
element nodes represent both elements and attributes. We add the prefix to 
the attribute names to distinguish them from the elements. We assign a Hash ID 
to each distinct tag/attribute name and store it in a hashtable, i.e. the Hashtable 
in Fig. 4. The XML data items are separated from the structure and are com- 
pressed into different blocks accessible via the Hashtable. Hence, no text nodes 
are considered in our model. We do not model namespaces, Pis and comments 
for simplicity, though it is a straightforward extension to include them. 

Formally, the structure tree of an XML document is an unranked, ordered 
tree, T = (Vr, -Et, ROOT), where Vt and Et are the set of tree nodes and edges 
respectively, and ROOT is the unique root of T. We define a tree node v € Vr 
by v = ( eid , rnd, ext), where v.eid is the Hash ID of the element /attribute being 
modelled by v: v.nid is the unique node identifer assigned to v according to 
document order; initially v.ext = {v.nid}. We represent each node v by the pair 
(v.eid, v.nid). The pair (ROOT. eid, ROOT. nid) is uniquely assigned as (0,0). 
In addition, if a node v has n (ordered) children (f3\, . . . ,/3 n ), their order in T 
is specified as: v./3\.eid < v.f3 2 .eid < ■ ■ ■ < v./3 n .eid\ and if v.Pi.eid = v.pi+i.eid, 
then v.pi.nid < v.fii+i.nid. This node ordering accelerates node matchings in T 
by an approximate factor of 2, since we match two nodes by their eids and on 
average, we only need to search half of the children of a given node. 

Definition 1. (Branch and Branch Ordering) A branch of T , denoted as 
b, is defined by b = vq — >■■■—> Vi —>■■■—> v p , where v p is a leaf node in T and 
Vi - 1 is parent ofvi for 0 < i < p. Let B be a set of branches of a tree or a subtree. 
A branch ordering -< on B is defined as: V&i, 62 € B, let &i = uq —>■■■—> u p 
and b 2 = vq —>■■■■ — > v q , 61 -< b 2 implies that there exists some i such that 
Ui.nid = Vi .nid and tq+i.rad ^ Vi+i.nid, and either (1) Ui + i.eid < Uj+i .eid, or 
(2) Ui+i.eid = Vi+\.eid and Ui+\.nid < Vi+i.nid. 

For example, given 61 = (0,0) —»•••—>• (3,4), 6 2 = (0,0) —»•••—» (9,11), £>3 
= (0,0) —»•••—» (3,20) in Fig. 2, we have b\ -< b 2 , b 2 -< £>3 and b± -< 63. We can 
describe a tree as the sequence of all its branches ordered by For example, 
the subtree rooted at the node (17,27) in Fig. 2 can be represented as: (17,27) 

> (3,28) -<I (17,27) -> > (33,29) -! (17,27) -> > (70,31), while the 

tree in Fig. 3 is represented as: a:(3,4) -< x(9,8) ~< cr(89,7) -< a;(33,5) -< #(70,13) 
-< #(3,15) -< #(33,16) -< #(70,18), where # denotes (0,0) —>•••—> for simplicity. 
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Definition 2. (Sit-Equivalence) Two branches, b\ = uo —>■■■■ —> u p and 
b 2 = Vq ^ ^ v q , are SIT- equivalent if Ui.eid = Vi.eid for 0 < i < p and 

p = q. Two subtrees, t,\ = 6 10 -<■■■-< b\ m and t 2 = b 2 o -< ■ ■ ■ -< b 2n , are 
SIT-equivalent if t\. ROOT and t 2 .ROOT are siblings and, bu and b 2 i are SIT- 
equivalent for 0 < i < m and m = n. 

For example, in Fig. 2, the subtrees rooted at the nodes (17,14) and (17,27) 
are SIT-equivalent subtrees since every pair of corresponding branches in the two 
subtrees are SIT-equivalent. The SIT-equivalent subtrees are duplicate struc- 
tures in XML data and thus we eliminate this redundancy by using a merge 
operator defined as follows. 

Definition 3. (Merge Operator) A merge operator, Merger, is defined as: 
Merger : (t\,t 2 ) — > t, where t\ and t 2 are SIT-equivalent and t\.ROOT.nid < 
t 2 .ROOT.nid, t± = 610 -< ••• -< b\ n and t 2 = 620 -<■■■-< b 2n , and bn = 
uo — >•••—> Up and b 2 i = vq —■ ► • • • -4 ► v p . For 0 < i < n, Merger assigns 
Uj.ext = Uj.ext U Vj.ext for 0 < j < p, and then deletes b 2 i. 

Thus, the merge operator merges t\ and t 2 to produce t, where t is SIT- 
equivalent to both t\ and t 2 . The effect of the merge operation is that the 
duplicate SIT-equivalent structure is eliminated. We can remove this redundancy 
in the structure tree to obtain a much more concise structure representation, the 
Structure Index Tree (SIT), by applying Merger iteratively on the structure tree 
until no two SIT-equivalent subtrees are left. For example, the tree in Fig. 3 is 
the SIT for the structure tree in Fig. 2. Note that all SIT-equivalent subtrees in 
Fig. 2 are merged into a corresponding SIT-equivalent subtree in the SIT. 

A structure tree and its SIT are equivalent, since the structures of the deleted 
SIT-equivalent subtrees are retained in the SIT. In addition, the deleted nodes 
are represented by their node identifiers kept in the node exts while the deleted 
edges can be reconstructed by following the node ordering. Since the SIT is 
in general much smaller than its structure tree, it allows more efficient node 
selection than its structure tree. 

3.2 SIT Construction 

In this section, we present an efficient algorithm to construct the SIT for an XML 
document. We define four node pointers, parent, previousSibling , nextSibling, 
and firstChild , for each tree node. The pointers tremendously speed up node 
navigation for both SIT construction and query evaluation. The space incurred 
for these pointers is usually insignificant since a SIT is often very small. 

We linear-scan (by SAX) an input XML document only once to build its 
SIT and meanwhile we compress the text data (detailed in Section 4). For ev- 
ery SAX start/end-tag event (i.e. the structure information) parsed, we invoke 
the procedure const ruct_S IT, shown in Fig. 5. The main idea is to operate on a 
“base” tree and a constructing tree. A constructing tree is the tree under con- 
struction for each start-tag parsed and it is a subtree of the “base” tree. When 
an end-tag is parsed, a constructing tree is completed. If this completed subtree 




XQzip: Querying Compressed XML Using Structural Indexing 225 



procedure construct_SIT ( SAX-Event ) 

/* stack is an array keeping the start/end tag information (either START-TAG or END-TAG)', 

top indicates the stack top; c is the current node pointer; count initially is set to 0 */ 

begin 



1 . 

2. 

3. 

4. 

5. 

6. 

7. 

8. 

9. 

10 . 
11 . 
12 . 

13. 

14. 

15. 

16. 

17. 

18. 

19. 

20 . 
21 . 
end 



if ( SAX-Event is a start-tag event) /* an attribute is also a start-tag event */ 

create a new node, u, where u.eid := hash ( SAX-Event ) and count := count +1, u.nid := count', 
if {stack [top] = START-TAG ) 
assign u as the firstchild of c; 
else 

insert u among the siblings of c according to the SIT node ordering; 
top := top + 1 ; stack [top] := START-TAG', 

else if {SAX-Event is an end-tag event) /* an end-tag event is also passed after processing an attribute value */ 

if {subtree (c) is SIT-equi valent to subtree (one of c’s preceding siblings, u )) /* check by a parallel DFS */ 

Merge T { {subtree {u), subtree (c) ); 



if {stack [top] = START-TAG ) 
if {stack [top - 1] = START-TAG ) 
stack [top] := END-TAG', 

else 

top := top — 1 ; 



/* c has no child and the START-TAG was pushed for c */ 

/* c is the first child of its parent */ 

/* finish processing c */ 

/* c has preceding sibling(s) (processed) */ 

/* use the previous end-tag to indicate c has been processed */ 



else /* the END-TAG indicates c’s child processed, stack [top- 1] must be start-tag indicating c not processed */ 
if {stack [top - 2] = START-TAG ) /* c is the first child of its parent */ 

top := top - 1 ; stack [top] := END-TAG', /* remove c’s child’s stack and indicates c has been processed */ 
else /* c’s preceding sibling(s) processed */ 

top := top - 2; /* use c’s preceding sibling’s end-tag, i.e. stack [top- 2], to indicate c has been processed */ 



c := u', 



Fig. 5. Pseudocode for the SIT Construction Procedure 



is SIT-equi valent to any subtree in the “base” tree, it is merged into its SIT- 
equivalent subtree; otherwise, it becomes part of the “base” tree. We use a stack 
to indicate the parent-child or sibling-sibling relationships between the previous 
and the current XML element to build the tree structure. Lines 11-20 maintain 
the consistency of the structure information and skip redundant information. 
Hence, the stack size is always less than twice the height of the SIT. 

The time complexity is 0(|Vx|) hr the average-case and 0(|S7 T||Vt|) in 
the worse-case, where \Vt\ is the number of tags and attributes in the XML 
document and \SIT\ is the number of nodes in the SIT. 0(|S7T||Vr|) is the 
worst-case complexity because we at most compare and merge 2\SIT\ nodes for 
each of the \Vt\ nodes parsed. However, in most cases only a constant number 
of nodes are operated on for each new element parsed, resulting in the 0(|Vr|) 
time. The space required is \Vt\ for the node exts and at most 2\SIT\ for the 
structure since at all time, both the “base” tree and the constructing tree can 
be at most as large as the final tree (i.e. the SIT). 

SIT and F&B-Index. The SIT shares some similar features with the F&B- 
Index [1,7]. The F&B-Index uses bisimulation [7,12] to partition the data nodes 
while we use SIT-equi valence to index the structure tree. However, the SIT pre- 
serves the node ordering whereas bisimulation preserves no order of the nodes. 
This node ordering reduces the number of nodes to be matched in query evalu- 
ation and in SIT construction by an average factor of 50%. The F&B-Index can 
be computed in time 0(?nlogn), where m and n are the number of edges and 
nodes in the original XML data graph, by first adding an inverse edge for every 
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edge and then computing the 1-Index [9] using an algorithm proposed in [11]. 
However, the memory consumption is too high, since the entire structure of an 
XML document must be first read into the main memory. 



4 A Queriable Storage Model for Compressed XML Data 

In this section, we discuss a storage model for the compressed XML data. We 
seek to balance the full-chunked and the fine-grained storage models so that the 
compression algorithm is able to exploit the commonalities in the XML data to 
improve compression (i.e. the full-chunk approach), while allowing efficient re- 
trieval of the compressed data for query evaluation (i.e. the fine-grain approach). 

We group XML data items associated with the same tag/attribute name 
into a same data stream (c.f. this technique is also used in XMill [8]). Each data 
stream is then compressed separately into a sequence of blocks. These compressed 
blocks can be decompressed individually and hence full decompression is avoided 
in query evaluation. The problem is that if a block is small, it does not make 
good use of data commonalities for a better compression; on the other hand, it 
will be costly to decompress a block if its size is large. Therefore, it is critical 
to choose a suitable block size in order to attain both a good compression ratio 
and efficient retrieval of matching data in the compressed file. 

We conduct an experiment (described in Section 6.1) and find that a block 
size of 1000 data records is feasible for both compression and query evaluation. 
Hence we use it as the default block size for XQzip. In addition, we set a limit of 
2 MBytes to prevent memory exhaustion, since some data records may be long. 
When either 1000 data records have been parsed into a data stream or the size 
of a data stream reaches 2 MBytes, we compress the stream using gzip, assign 
an id to the compressed block and store it on disk, and then resume the process. 

The start position of a block in the compressed file is stored in the Element 
Haslrtable. (Note that gzip can decompress a block given its start position and 
an arbitrary data length.) We also assign an id to each block as the value of 
the maximum node identifier of the nodes whose data is compressed into that 
block. To retrieve the block which contains the compressed data of a node, we 
obtain the block position by using the containment relationship of the node’s 
node identifier and the ids of the successive compressed blocks of the node’s data 
stream. The position of the node’s data is kept in an array and can be obtained 
by a binary search on the node identifier (in our case, this only takes log 1000 
time since each block has at most 1000 records) and the data length is simply 
the difference between two successive positions. 

A desirable feature of the queriable compressors XGrind [14] and XPRESS 
[10] is that decompression is avoided since string conditions can be encoded 
to match with the individually compressed data, while with our storage model 
(partial) decompression is always needed for the matching of string conditions. 
However, this is only true for exact-match and numeric range-match predicates, 
decompression is still inevitable in XGrind and XPRESS for any other value- 
based predicates such as string range-match, starts-with and substring matches. 
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To evaluate these predicates, our block model is much more efficient, since de- 
compressing x blocks is far less costly than decompressing the corresponding 
1 OOOx individually compressed data units. More importantly, as we will discuss 
in Section 5.2, our block model allows the efficient management of a buffer pool 
which significantly reduces the decompression overhead, while the compressed 
blocks serve naturally as input buffers to facilitate better disk reads. 

5 Querying Compressed XML Using SIT 

In this section, we present the queries supported by XQzip and show how they 
are evaluated on the compressed XML data. 

5.1 Query Coverage 

Our implementation of XQzip supports most of the core features of XPatlr 1.0 
[15]. We extend XPath to select an arbitrary set of distinct elements by a single 
query and we also give a mapping to reduce the verbosity of the XPath syntax. 

XPath Queries. A query specifies the matching nodes by the location path. A 
location path consists of a sequence of one or more location steps, each of which 
has an axis, a node test and zero or more predicates. The axis specifies the 
relationship between the context node and nodes selected by the location step. 
XQzip supports eight XPath axes: ancestor , ancestor-or-self , attribute, child , 
descendant, descendant-or-self , parent and self. XQzip simplifies the node test by 
comparing just the eids of the nodes. The predicates use arbitrary expressions, 
which can in turn be a location path containing more predicates and so on 
recursively, to further refine the set of nodes selected by the location step. 

Apart from the comparison operators (=, ! =, >, <, >= and <=) and 
string operators ( contains , i.e. substring, and starts-with) , XQzip supports a com- 
plete set of standard aggregation operators ( count and sum, average, minimum 
and maximum). XQzip also allows structure-based, value-based, and aggregation 
predicates to be combined by the logical operators (not, or and and). 

XPath Group Queries. An XPath query can only specify one distinct element 
to be selected at a time. We modify the XPath syntax slightly to make it possible 
to select an arbitrary set of distinct elements by a single query, which we call an 
XPath group query. We use “(” and “)” to indicate the grouping, and “+” to 
represent the union of elements in a group. For example, the XPath group query 
“(//Orderitem[discount[. >= 20% and . <= 50%]]/(@id + quantity + price))” 
selects three elements from “Orderitem” with a “discount” of 20-50%. 

Evaluating an XPath group query is much more efficient than evaluating a 
group of XPath queries, since all location paths inside a group share the same 
context node addressed by the location path just preceding the group. For ex- 
ample, given (l/(l 0 + • • • + l n )), we evaluate l only once for all U.. 

Abbreviated Syntax. Although the syntax of XPath is straightforward, it 
is rather verbose as a query language. We map the XPath axes, together with 
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Table 1. Abbreviated Syntax 



Full Form 




Full Form 


misni 


| Full Form | 


CSEB 


Full Form | 


C538 


| Full Form | 


waini 


self 




descendant 


mm 


ir.i.iwcimw 


! 


HUUISHI 




ram 


K£B 


child 


/ 


ancestor-or-self 


A 


iBBnaii^a 


i 


mssm 


na 


wildcard 


* 


parent 


\ 


descendant-or-self 


/. 


|logical-and | 


& 




in 


contains 


? = 


ancestor 


K9 


root 






EM 


msr™ 


E9 


starts-with 


$= 



the functions and operators, to more concise syntactic abbreviations. We show 
the mapping in Table 1 (Examples of mapping are given in [4] , but omitted due 
to space limitation.). We note that the abbreviation of the axes child, attribute, 
self, descendant-or-self and parent are also given in [15], but we give different 
abbreviation to the last two axes as to give a complete but easy mapping for the 
queries covered by XQzip. In order to make parsing easier, our query parser also 
requires that predicates in queries be fully parenthesized. 

5.2 Query Evaluation 

XQzip evaluates queries in four major phases: (1) query parsing; (2) node selec- 
tion; (3) data retrieval; and (4) query result output. 

Query Parsing. The query parser translates an input query into a stream 
of events represented as integers, with positive values representing the XML 
elements (i.e. their Hash IDs) and negative values representing other expressions. 

Node Selection. Node selection is critical in query evaluation. A survey [6] 
shows that contemporary XPatlr query engines evaluate XPatlr queries in ex- 
ponential time. The cause of the exponential time evaluation is that for each 
location step, a set of nodes of size linear in the size of the document may be 
selected and each node in this set may in turn select a linear number of nodes 
for the next location step. Hence, the time complexity is where |D| is the 

document size and |Q| the query size. Although [6] proposes a polynomial-time 
XPatlr evaluation algorithm, it is not applicable with our setting. We propose a 
simple algorithm which gives polynomial time complexity in the average case. 

Our algorithm basically divides an axis closure into two disjoint areas. We 
associate each node in the SIT a visited flag. A subtree is visited if its root’s 
visited flag is set. The union of all visited subtrees in an axis closure with respect 
to a context node forms the visited-closure, and the unvisited-dosure is simply 
the difference between the axis closure and the visited_closure. 

We give the core of our query evaluation algorithm in Fig. 6. The idea is as 
follows: on evaluating Si . . . s n where a, is the descendant or descendant-or-self 
axes (and similarly if at is the ancestor or ancestor-or-self axes) w.r.t. a context 
node u, the subtree rooted at u is set to be visited when the evaluation process 
finishes s, . . . s n w.r.t. u (regardless of the evaluation result), since the result 
of Si ... s n will always be the same for the same context node u. Moreover, an 
ancestor always includes its descendants and hence we set a node visited when 
it is included in the result set. Consequently, as the evaluation process goes 
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on, more and more subtrees will be visited and the unvisited -closure becomes 
smaller and even vanishes. This implies that the nodes selected at each location 
step are no longer linear at later stages of the query evaluation. Hence, we have 
the average-case polynomial query evaluation time in the size of the SIT. 

The worst-case time complexity is still exponential, i.e. since the 

unvisited-closure has no effect on predicates. Nonetheless, predicate evaluation 
rarely checks all nodes specified by the predicate’s location path but it terminates 
as soon as one evaluation returns true. More importantly, \SIT\ is often orders of 
magnitude smaller than \D\, implying that |.S77’j ^ is much smaller than |_D|^. 
The space complexity is 0(|S7T| + |U T |): 0(|S7T|) since we only hold the SIT 
in memory and all nodes in the result set are distinct (we do not count the space 
requirements for the buffer pool and for writing query result) and 0(|Vr|) space 
is needed to indicate which elements are matched for value-based predicates. 



procedure evaluate_query ( u, i, Q ) 

/* u is the context node and i is initially set to 0; Q: so • • • Si . . . s n , where Si = < f,, ptj > */ 

begin 

1 . for each node v in unvisited_closure ( a t (w) ) do 

2. if ( u ( v ) is true and for all j, pij is true for v ) 

3. if ( i < n ) 

4. evaluate _query ( v, i + 1, Q ); 

5. else /* i = n */ 

6. include v in the query result set and set v. visited _flag\ 

7. if ( a< = \\ ) 

8. set v. visited _flag\ 

9. if ( a t = II ) 

10. set u.visited _flag\ 

end 



Fig. 6. Core Query Evaluation Algorithm 



Data Retrieval and Decompression. We have described the retrieval of a 
compressed block and the retrieval of data from a decompressed block in Section 
4. Although the data retrieval cost is not expensive, an element may appear in 
many places of a query or in a set of queries asked consecutively, resulting in a 
compressed block being retrieved and decompressed many times. Since we use 
gzip as our underlying compression tool, we cannot do much to improve the time 
to decompress a block. Instead, we avoid the scenario that the same block being 
repeatedly decompressed by introducing a buffer pool. 

XQzip applies the LRU rule to manage a buffer pool for holding recently 
decompressed XML data. The buffer pool is modelled as a doubly-linked list 
with a head and tail pointer and the buffers do not have a fixed size but are 
allocated dynamically according to decompressed data size. When a new block 
is decompressed, the buffer manager appends it to the tail of the list. When a 
block is accessed again, the buffer manager takes it out from the list and appends 
it to the tail. We set a memory limit (default 16 MBytes) to the total size of the 
buffer pool. When the memory limit is reached, the buffer manager removes the 
buffers at the head of the list until memory is sufficient to allocate a new buffer. 
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Each buffer in the pool can be instantly accessed from the Hashtable and 
is assigned an id which is the same as the compressed block id, thus, avoiding 
decompressing a block again if a buffer with the same id is already in the pool. 

The data access patterns of queries asked at a certain time are usually similar 
according to the principle of locality. Therefore, after some queries have been 
evaluated and the buffers have been initialized, new blocks tend to be decom- 
pressed only occasionally. Our experimental evaluation result shows that the 
buffer pool significantly reduces the querying time: the average querying time 
measured with a warm (initialized) buffer pool is 5.14 times less than that with 
a cold buffer pool. Moreover, restoring the original XML document from the 
compressed file is also much faster with a warm buffer pool. 

Query Result Output. The query processor produces the query result speci- 
fied by the output expression. XQzip allows the following output expressions: (1) 
not specified: all elements in the result set are returned; (2) location path /text(): 
only text contents of the result elements are returned; (3) location path/op: one 
of the five aggregation operations; and (4) [Q\ : returns true if Q evaluates to be 
true, false otherwise. 

6 Experimental Evaluation 

We evaluated the performance of XQzip by an extensive set of experiments. All 
experiments were run on a Windows XP machine with a Pentium 4, 2.4 GHz 
and 256 MBytes main memory. We compared our compression performance with 
XMill, gzip and XGrind, and query performance with XGrind. Since XGrind is 
not able to compress all the datasets used in our evaluation and simply outputs 
query results as “found” or “not found” , we modified the XGrind source code to 
make it work for all the datasets we used and write query results to a disk file, 
as XQzip does. We also made XGrind adapt to our experimental platform. 

We first studied the effect of using different sized data blocks on the compres- 
sion and query performance of XQzip; the aim of this experiment is to choose a 
feasible default block size for XQzip. We then performed, for each data source, 
four classes of experiments: (1) the effectiveness of the SIT; (2) compression 
ratios; (3) compression/decompression time; and (4) query performance. We 
define the compression ratio as: Compression Ratio = (1 — Compressed file 
size/ Original XML file size) * 100%, and we measure all the time in seconds. 

We use eight data sources for our evaluation, which cover a wide range of 
XML data formats and structures. A description of the datasets is given in [4] 
due to space limit but we give their characteristics in Table 2, where E_num and 
A_num refer to the number of elements and attributes in the dataset respectively. 

6.1 Effect of Using Different Block Size 

We carried out a set of experiments to explore the effects of using different data 
block sizes on compression and query performance. We chose three representative 
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Table 2. XML Data Sources 



Data Source 


Size (MB) 


Depth 


T aqs/Attrs 


E_num 


A_num 


XMark 


111 


11 


86 


1666315 


381878 


OMIM 


24.5 


5 


22 


188052 


0 


DBLP 


148 


6 


41 


3883112 


471124 


SwissProt 


109 


5 


100 


2977031 


2189859 


Treebank 


82 


36 


252 


2437666 


1 


PSD 


683 


7 


72 


21305818 


1052770 


Shakespeare 


7.3 


6 


23 


179072 


0 


Lineitem 


30.8 


3 


19 


1022976 


1 



documents: SwissProt (which has no heavy text items), XMark (which has a lot 
of data and one heavy text item) and OMIM (whose data content is dominated 
by very heavy texts) for running the experiments. 

Compression. For all datasets, compression performance is extremely poor for 
block sizes less than 2 KBytes and improves linearly with the increase in block 
size (greater than 2 KBytes), but does not improve much (within 10%) for block 
sizes beyond 100-150 (SwissProt: ~ 150, XMark: ~130, OMIM: ~100) KBytes. 




Fig. 7. Querying Time with Different Block Sizes 



Query Evaluation. We use range predicates to select a set of queries (the 
queries are listed in [4] due to the space limit) of different selectivity for each 
dataset: low-selectivity (appr. 0.01%, 0.03%, 0.05%, 0.08% and 0.1%), medium- 
selectivity (appr. 0.3%, 0.5%, 0.7%, 1% and 3%) and high-selectivity (appr. 5%, 
20%, 50%, 80% and 100%). For each dataset, we plot the average querying time 
of the queries of each selectivity group, represented by the prefixes L, M and H 
respectively in Fig. 7. We also found that the block size is actually sensitive to 
the number of records per block instead of number of bytes per block. We thus 
measure the block size in terms of number of data records per block. 

For all the three data sources, query performance is poor on small block 
sizes (less than 100 records). High-selectivity queries have better performance 
on larger block sizes though performance improves only slightly for block sizes 
beyond 1000 records. Medium and low selectivity queries have best performance 
in the range of 500 to 800 records and 250 to 300 records respectively, and their 
querying time increases linearly for block sizes exceeding the optimal ranges. The 
difference in querying time of the various selectivity queries with the change in 
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block size is mainly due to the inverse correlation between the decompression 
time of the different-sized blocks and the total number of blocks to be decom- 
pressed w.r.t. a particular block size, i.e. larger blocks have longer decompres- 
sion time but fewer blocks need be decompressed, and vice versa. Although the 
optimal block size does not agree for the different data sources and different 
selectivity queries, we find that within the range of 600 to 1000 data records per 
block, the querying time of all queries is close to their optimal querying time. 
We also find that a block size of about 950 data records is the best average. 

For most XML documents, a total size of 950 records of a distinct element 
is usually less than 100 KBytes, a good block size for compression. However, to 
facilitate query evaluation, we choose a block size of 1000 data records per block 
(instead of 950 for easier implementation) as the default block size for XQzip, 
and we demonstrate that it is a feasible choice in the subsequent subsections. 



6.2 Effectiveness of the SIT 

In this subsection, we show that the SIT is an effective index. In Table 3, \T\ 
represents the total number of tags and attributes in each of the eight datasets, 
while | Vt\ and \Vj\ show the number of nodes (presentation tags not indexed) 
in the structure tree and in the SIT respectively; |V7|/|Vr| is the percentage of 
node reduction of the index; Load Time (LT) is the time taken to load the SIT 
from a disk file to the main memory; and Acceleration Factor (AF) is the rate 
of acceleration in node selection using the SIT instead of the F&B-Index. 



Table 3. Index Size 



Data Source 


■— 


■Elll 


IBM 


|V,|/|V T | 


LT 




XMark 


2048193 


1837608 




1.64% 






OMIM 


188052 


188052 




0.24% 






DBLP 


4354236 






0.04% 




1*111 


SwissProt 


■ilKTiKCItl 


mir.T.-mi] 


iEftfflifcH 








Treebank 


| 2437667| 


| 2437667| 




93.42% 






PSD 


*M3gaa»a 


■WBiwaa 




10.85% 






Shakespeare 


179072 


179072 


mmE 


iSHU 






Lineitem 


1022977 


1022977 


m 


0.002%| 


■t)EW3 





For five out of the eight datasets, the size of the SIT is only an average of 0.7% 
of the size of their structure tree, which essentially means that the query search 
space is reduced approximately 140 times. For SwissProt and PSD, although the 
reduction is smaller, it is still a significant one. The SIT of Treebank is almost 
the same size as its structure tree, since Treebank is totally irregular and very 
nested. We remark that there are few XML data sources in real life as irregular as 
Treebank. Note also that most of the SITs only need a fraction of a second to be 
loaded in the main memory. We find that the load time is roughly proportional 
to | Vj | / 1 Vt | (i.e. irregularity) and \Vr\ of an XML dataset. 

We built the F&B-Index (no idrefs , presentation tags and text nodes), using 
a procedure described in [7]. However, it ran out of memory for DBLP, SwissProt 
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and PSD datasets on our experimental platform. Therefore, we performed this 
experiment on these three datasets on another platform with 1024 MBytes of 
memory (other settings being the same). On average, the construction (including 
parsing) of the SIT is 3.11 times faster than that of the F&B-Index. We next 
measured the time taken to select each distinct element in a dataset using the 
two indexes. The AF for each dataset was then calculated as the sum of time 
taken for all node selections of the dataset (e.g. 86 node selections for XMark 
since it has 86 distinct elements) using the F&B-Index divided by that using the 
SIT. On average, the AF is 2.02, which means that node selection using the SIT 
is faster than that using the F&B-Index by a factor of 2.02. 



sS 80 - 
.2 70 - 

60 - 

■i 50 -III 
| 40 - 

2- 30 4f I 



| ci|XQzip+ a XQzip □XMill pgzip MXGrindj 



OMIM DBLP 



Data Sources 



Fig. 8. Compression Ratio 



6.3 Compression Ratio 

Fig. 8 shows the compression ratios for the different datasets and compressors. 
Since XQzip also produces an index file (the SIT and data position information) , 
we represent the sum of the size of the index file and that of the compressed file 
as XQzip+. On average, we record a compression ratio of 66.94% for XQzip+, 
81.23% for XQzip, 80.94% for XMill, 76.97% for gzip, and 57.39% for XGrind. 

When the index file is not included, XQzip achieves slightly better compres- 
sion ratio than XMill, since no structure information of the XML data is kept 
in XQzip’s compressed file. Even when the index file is included, XQzip is still 
able to achieve a compression ratio 16.7% higher than that of XGrind, while the 
compression ratio of XPRESS only levels with that of XGrind. 

6.4 Compression/Decompression Time 

Fig. 9a shows the compression time. Since XGrind’s time is much greater than 
that of the others, we represent the time in logarithmic scale for better viewing. 
The compression time for XQzip is split into three parts: (1) parsing the input 
XML document; (2) applying gzip to compress data; and (3) building the SIT. 
The compression time for XMill is split into two parts as stated in [8]: (l)parsing 
and (2) applying gzip to compress the data containers. There is no split for gzip 
and XGrind. On average, XQzip is about 5.33 times faster than XGrind while 
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it is about 1.58 times and 1.85 times slower than XMill and gzip respectively. 
But we remark that XQzip also produces the SIT, which contributs to a large 
portion of its total compression time, especially for the less regular data sources 
such as Treebank. 

Fig. 9b shows the decompression time for the eight datasets. The decompres- 
sion time here refers to the time taken to restore the original XML document. 
We include the time taken to load the SIT to XQzip’s decompression time, rep- 
resented as XQzip-h On average, XQzip is about 3.4 times faster than XGrind 
while it is about 1.43 time and 1.79 times slower than XMill and gzip respec- 
tively, when the index load time is not included. Even when the load time is 
included, XQzip’s total time is still 3 times shorter than that of XGrind. 



| □ parse [3 compress aindex 
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Fig. 9. (a) Compression Time (b) Decompression Time (Seconds in logio scale) 



6.5 Query Performance 

We measured XQzip’s query performance for six data sources. For each of the 
data sources, we give five representative queries which are listed in [4] due to 
the space limit. For each dataset except Treebank, Q1 is a simple path query for 
which no decompression is needed during node selection. Q2 is similar to Q1 but 
with an exact-match predicate on the result nodes. Q3 is also similar to Q1 but 
it uses a range predicate. The predicates are not imposed on intermediate steps 
of the queries since XGrind cannot evaluate such queries. Q4 and Q5 consists 
multiple and deeply nested predicates with mixed structure-based, value-based, 
and aggregation conditions. They are used to evaluate XQzip’s performance 
on complex queries. The five queries of Treebank are used to evaluate XQzip’s 
performance on extremely irregular and deeply nested XML data. 

We recorded the query performance results in Table 4. Column (1) records 
the sum of the time taken to parse the input query and to select the set of 
result nodes. In case decompression is needed, the time taken to retrieve and 
decompress the data is given in Column (2). Column (3) and Column (4) give the 
time taken to write the textual query results (decompression may be needed) and 
the index of the result nodes respectively. Column (5)is the total querying time, 
which is the sum of Column (1) to (4) (note that each query was evaluated with 
an initially empty buffer pool). Column (6) records the time taken to evaluate 
the same queries but with the buffer pool initialized by evaluating several queries 
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Table 4. Query Evaluation Results 



Data 

Sources 


• (1) 

Node 
Selecting 
Time (sec) 


• (2) 

Partial 
Decomp. 
Time (sec) 


• 0) . (4) . (5) 

Result (text) Result (index) Querying 

Processing Processing Time (sec) 

Time (sec) Time (sec) (XQzip-) 


• (6) 
Querying 
Time (sec) 
(XQzip+) 


. (7) 
Querying 
Time (sec) 
(XGrind) 


• (8) 

Query 
Result (text) 
(KBytes) 


- (9) 

Query 

Result (index) 
(KBytes) 


XMark 


Q1 


0.001 


... 


0.911 


0.001 


0.913 


0.122 


22.774 




40 


(111MB) 


Q2 


0.001 


0.920 


0.012 


0.001 


0.934 


0.295 


23.067 




0.09 




Q3 


0.001 


3.395 


0.014 




3.411 


0.349 


35.012 


1.74 


0.22 




Q4 


0.003 


... 


0.551 


0.030 


0.584 


0.118 


... 


14999 


1256 




Q5 


0.831 


4.534 


0.010 


0.001 


5.376 


1.544 


... 


0.21 


0.03 


OMIM 


Q1 


0.001 


... 


0.030 


0.001 


0.032 


0.005 






23.6 


(24.5MB) 


Q2 


0.001 


0.021 


0.011 


0.001 


0.034 


0.014 






wmtn 




Q3 


0.001 


0.036 


0.057 


0.001 


0.095 


mm 










Q4 


0.005 


... 


... 


... 


0.005 


IB 




... 






Q5 


0.012 


0.020 


0.580 


0.001 


0.613 


•fix 




1666 




DBLP 


Q1 


0.001 


... 




0.010 




0.034 


19.582 


7219 


621 


(148MB) 


Q2 


0.001 


0.330 


0.013 


0.001 


0.345 


0.029 


26.108 


59 


6 




Q3 


0.033 


0.391 


8.997 


0.120 


9.541 


1.543 


50.344 


22940 


1853 




Q4 


0.001 


... 


0.000 


0.000 


0.001 


0.001 


... 


No Match 


No Match 




Q5 


0.087 


1.122 


0.260 


0.012 


1.481 


0.642 


... 


2312 


205 


Lineitem 


Q1 


0.001 


... 


0.041 


0.001 




0.011 


2.336 


1176 


175 


(30.8MB) 


Q2 


0.001 


0.031 


0.011 


0.001 




0.012 


2.890 


130 


16 




Q3 


0.001 


0.058 


0.015 


0.001 




0.014 


3.210 


393 


54 




Q4 


0.001 


... 


1.594 


0.082 


1.677 


0.342 


... 


31539 


4024 




Q5 


0.002 


0.030 


... 


... 




0.007 


... 


... 


... 


Shakespeare 


Q1 


0.001 


... 




0.001 


0.037 


0.014 


1.311 


865 


89 


(7.3MB) 


Q2 


0.001 


0.034 


0.002 


0.001 


0.038 


0.016 


1.620 


0.05 


0.001 




Q3 


0.001 


0.032 


0.005 


0.001 


0.039 


0.016 


2.312 


48 


2.3 




Q4 


0.005 


... 


... 


... 


0.005 


0.005 


... 


... 


... 




Q5 


0.007 


0.032 


... 


... 


0.039 


0.014 


... 


... 


... 


Treebank 


Q1 


0.321 


... 


3.304 


0.120 


3.745 


0.674 




21278 


5659 


(82MB) 


Q2 


0.167 


... 


0.010 


0.001 


0.178 


0.177 




0.45 


0.12 




Q3 


0.183 


... 


1.012 


0.064 


1.259 


0.453 




785 


204 




Q4 


0.124 


... 


6.123 


0.282 


6.529 


1.003 




24111 


6078 




Q5 


0.156 


... 


6.004 


0.274 


6.434 


0.985 




24111 


6078 



containing some elements in the query under experiment prior to the evaluation 
of the query. Column (7) records the time taken by XGrind to evaluate the 
queries. Note that XGrind can only handle the first three queries of the first five 
datasets and does not give an index to the result nodes. Finally, we record the 
disk file size of the query results in Column (8) and (9). Note that for the queries 
whose output expression is an aggregation operator, the result is printed to the 
standard output (i.e. C++ stdout) directly and there is no disk write. 

Column (1) accounts for the effectiveness of the SIT and the query evaluation 
algorithm, since it is the time taken for the query processor to process node 
selection on the SIT. Compared to Column (1), the decompression time shown 
in Column (2) and (3) is much longer. In fact, decompression would be much 
more expensive if the buffer pool is not used. Despite of this, XQzip still achieves 
an average total querying time 12.84 times better than XGrind, while XPRESS 
is only 2.83 times better than XGrind. When the same queries are evaluated with 
a warm buffer pool, the total querying time, as shown in Column (6), is reduced 
5.14 times and is about 80.64 times shorter than XGrind’s querying time. 
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7 Conclusions and Future Work 

We have described XQzip, which supports efficient querying compressed XML 
data by utilizing an index (the SIT) on the XML structure. We have demon- 
strated by employing rich experimental evidence that XQzip (1) achieves com- 
parable compression ratios and compression/decompression time with respect 
to XMill; (2) achieves extremely competitive query performance results on the 
compressed XML data; and (3) supports a much more expressive query language 
than its counterpart technologies such as XGrind and XPRESS. We notice that 
a lattice structure can be defined on the SIT and we are working to formulate a 
lattice whose elements can be applied to accelerate query evaluation. 

Acknowledgements. This work is supported in part by grants HKUST 
6185/02E and HKUST 6165/03E from the Research Grant Council of Hong 
Kong. 
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Abstract. In this paper we present HOPI , a new connection index for 
XML documents based on the concept of the 2-hop cover of a directed 
graph introduced by Cohen et al. In contrast to most of the prior work 
on XML indexing we consider not only paths with child or parent rela- 
tionships between the nodes, but also provide space- and time-efficient 
reachability tests along the ancestor, descendant, and link axes to sup- 
port path expressions with wildcards in our XXL search engine. We im- 
prove the theoretical concept of a 2-hop cover by developing scalable 
methods for index creation on very large XML data collections with 
long paths and extensive cross-linkage. Our experiments show substan- 
tial savings in the query performance of the HOPI index over previously 
proposed index structures in combination with low space requirements. 



1 Introduction 

1.1 Motivation 

XML data on the Web, in large intranets, and on portals for federations 
of databases usually exhibits a fair amount of heterogeneity in terms of 
tag names and document structure even if all data under consideration is 
thematically coherent. For example, when you want to query a federation 
of bibliographic data collections such as DBLP, Citeseer, ACM Digital Li- 
brary, etc., which are not a priori integrated, you have to cope with struc- 
tural and annotation (i.e. , tag name) diversity. A query looking for au- 
thors that are cited in books could be phrased in XPath-style notation as 
//book//citation//author but would not find any results that look like 
/monography /bibliography/reference/paper/writer. To address this issue 
we have developed the XXL query language and search engine [24] in which 
queries can include similarity conditions for tag names (and also element and 
attribute contents) and the result is a ranked list of approximate matches. In 
XXL the above query would look like //^book//^citation//^author where 
~ is the symbol for “semantic” similarity of tag names (evaluated in XXL based 
on quantitative forms of ontological relationships, see [23]). 

When application developers do not have complete knowledge of the under- 
lying schemas, they would often not even know if the required information can 
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be found within a single document or needs to be composed from multiple, con- 
nected documents. Therefore, the paths that we consider in XXL for queries of 
the above kind are not restricted to a single document but can span different 
documents by following XLink [12] or XPointer kinds of links. For example, a 
path that starts as /monography/bibliography/ref erence/URL in one docu- 
ment and is continued as /paper/authors/person in another document would 
be included in the result list of the above query. But instead of following a URL- 
based link an element of the first document could also point to non-root elements 
of the second documents, and such cross-linkage may also arise within a single 
document. 

To efficiently evaluate path queries with wildcards (i.e., // conditions in 
XPath), one needs an appropriate index structure such as Data Guides [14] and 
its many variants (see related work in Section 2). However, prior work has mostly 
focused on constructing index structures for paths without wildcards, with poor 
performance for answering wildcard queries, and has not paid much attention to 
document-internal and cross-document links. The current paper addresses this 
problem and presents a new path index structure that can efficiently handle path 
expressions over arbitrary graphs (i.e., not just trees or nearly-tree-like DAGs) 
and supports the efficient evaluation of queries with path wildcards. 



1.2 Framework 

We consider a graph Gd = (Vd, Ed) for each XML document d that we know 
about (e.g., that the XXL crawler has seen when traversing an intranet or some 
set of Web sites), where 1) the vertex set Vd consists of all elements of d plus 
all elements of other documents that are referenced within d and 2) the edge set 
Ed includes all parent-child relationships between elements as well as links from 
elements in cl to external elements. 

Then, a collection of XML documents X = {d \, . . . , d n } is represented by 
the union Gx = (Vx,Ex) of the graphs Gi,...,G„ where Vx is the union of 
the Vd t and Ex is the union of the Ed t . We represent both document-internal 
and cross-document links by an edge between the corresponding elements. Let 
Lx = {(v,w) £ Ex\v £ Vdi,w £ Vd j : i ^ j} be the set of links that span 
different documents. 

In addition to this element-granularity global graph, we maintain the doc- 
ument graph DGx = (DVx, DEx) with DVx = {di,...,d n } and DEx = 
{( di,dj ) | 3 u £ di, v £ dj s.t. (u,v) £ L x j- Both the vertices and the edges 
of the document graph are augmented with weights: the vertex weight vw.- t for 
the vertex di is the number of elements that document di contains, and the edge 
weight eWij for the edge between di and dj is the total number of links that exist 
from elements of di to elements of dj. 

Note that this framework disregards the ordering of an element’s children 
and the possible ordering of multiple links that originate from the same ele- 
ment. The rationale for this abstraction is that we primarily address schema-less 
or highly heterogeneous collections of XML documents (with old-fashioned and 
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XML- wrapped HTML documents and href links being a special case, still inter- 
esting for Web information retrieval). In such a context, it is extremely unlikely 
that application programmers request accesss to the second author of the fifth 
reference and the like, simply because they do not have enough information 
about how to interpret the ordering of elements. 

1.3 Contribution of the Paper 

This paper presents a new index structure for path expressions with wildcards 
over arbitrary graphs. Given a path expression of the form / /A1//A2// . . . //Am, 
the index can deliver all sequences of element ids (ei, . . . , e m ) such that element 
ej has tag name Ai (or, with the similarity conditions of XXL, a tag name 
A' that is “semantically” close to Ai). As the XXL query processor gradually 
binds element ids to query variables after evaluating subqueries, an important 
variation is that the index retrieves all sequences (x, e 2 , . . . , e m ) or (ei, . . . , y) 
that satisfy the tag-name condition and start or end with a given element with 
id x or y, respectively. Obviously, these kinds of reachability conditions could 
be evaluated by materializing the transitive closure of the element graph Gx- 
The concept of a 2-hop cover, introduced by Edith Cohen et al. in [9], offers a 
much better alternative that is an order of magnitude more space-efficient and 
has similarly good time efficiency for lookups, by encoding the transitive closure 
in a clever way. The key idea is to store for each node n a subset of the node’s 
ancestors (nodes with a path to n) and descendants (nodes with a path from 
n). Then, there is a path from node x to y if and only if there is middle-man 2 
that lies in the descendant set of x and in the ancestor set of y. Obviously, the 
subset of descendants and ancestors that are explicitly stored should be as small 
as possible, and unfortunately, the problem of choosing them is NP-hard. 

Cohen et al. have studied the concept of 2-hop covers from a mostly theoret- 
ical perspective and with application to all sorts of graphs in mind. Thus they 
disregarded several important implementation and scalability issues and did not 
consider XML-specific issues either. Specifically, their construction of the 2-lrop 
cover assumes that the full transitive closure of the underlying graph has ini- 
tially been materialized and can be accessed as if it were completely in memory. 
Likewise, the implementation of the 2-lrop cover itself assumes standard main- 
memory data structures that do not gracefully degrade into disk-optimized data 
structures when indexes for very large XML collections do not entirely fit in 
memory. 

In this paper we introduce the HOPI index (2-HOP-cover-based Index) that 
builds on the excellent theoretical work of [9] but takes a systems-oriented per- 
spective and successfully addresses the implementation and scalability issues that 
were disregarded by [9]. Our methods are particularly tailored to the properties 
of large XML data collections with long paths and extensive cross-linkage for 
which index build time is a critical issue. Specifically, we provide the following 
important improvements over the original 2-lrop-cover work: 

— We provide a heuristic but highly scalable method for efficiently construct- 
ing a complete path index for large XML data collections, using a divide- 
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and-conquer approach with limited memory. The 2-hop cover that we can 
compute this way is not necessarily optimal (as this would require solving 
an NP-hard problem) but our experimental studies show that it is usually 
near-optimal. 

— We have implemented the index in the XXL search engine. The index itself 
is stored in a relational database, which provides structured storage and 
standard B-trees as well as concurrency control and recovery to XXL, but 
XXL has full control over all access to index data. We show how the necessary 
computations for 2-lrop-cover lookups and construction can be mapped to 
very efficient SQL statements. 

— We have carried out experiments with real XML data of substantial size, 
using data from DBLP [20], as well as experiments with synthetic data from 
the XMach benchmark [5]. The results indicate that the HOPI index is 
efficient, scalable to large amounts of data , and robust in terms of the 
quality of the underlying heuristics. 



2 Related Work 

We start with a short classification of structure indexes for semistructured 
data by the navigational axes they support. A structure index supports all 
navigational XPatlr axes. A path index supports the navigational XPath axes 
(parent, child, descendants-or-self , ancestors-or-self , descendants, 
ancestors). A connection index supports the XPath axes that are used 
as wildcards in path expressions (ancestors-or-self, descendantsor-self , 
ancestors, descendants). 

All three index classes traditionally serve to support navigation within the 
internal element hierarchy of a document only, but they can be generalized to 
include also navigation along links both within and across documents. Our ap- 
proach focuses on connection indexes to support queries with path wildcards, on 
arbitrary graphs that capture element hierarchies and links, axis): 

Structure Indexes. Grust et al. [16,15] present a database index structure 
designed to support the evaluation of XPath queries. They consider an XML 
document as a rooted tree and encode the tree nodes using a pre- and post- 
order numbering scheme. Zezula et al. [26,27] propose tree signatures for efficient 
tree navigation and twig pattern matching. Theoretical properties and limits of 
pre-/ post-order and similar labeling schemes are discussed in [8,17]. All these ap- 
proaches are inherently limited to trees only and cannot be extended to capture 
arbitrary link structures. 



Path Indexes. Recent work on path indexing is based on structural summaries 
of XML graphs. Some approaches represent all paths starting from document 
roots, e.g., Data Guide [14] and Index Fabric [10]. T-indexes [21] support a pre- 
defined subset of paths starting at the root. APEX [6] is constructed by utilizing 
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data mining algorithms to summarize paths that appear frequently in the query 
workload. The Index Definition Scheme [19] is based on bisimilarity of nodes. 
Depending on the application, the index definition scheme can be used to define 
special indexes (e.g. 1-Index, A(k)-Index, D(k)-Index [22], F&B-Index) where 
k is the maximum length of the supported paths. Most of these approaches can 
handle arbitrary graphs or can be easily extended to this end. 



Connection Indexes. Labeling schemes for rooted trees that support ancestor 
queries have recently been developed in the following papers. Alstrup and Raulre 
[2] enhance the pre-/postorder scheme using special techniques from tree clus- 
tering and alphabetic codes for efficient evaluation of ancestor queries. Kaplan 
et al. [8,17] describe a labeling scheme for XML trees that supports efficient 
evaluation of ancestor queries as well as efficient insertion of new nodes. In [1, 
18] they present a tree labeling scheme based on a two level partition of the tree, 
computed by a recursive algorithm called prune&contract algorithm. 

All these approaches are, so far, limited to trees. We are not aware of any in- 
dex structure that supports the efficient evaluation of ancestor and descendant 
queries on arbitrary graphs. The one, but somewhat naive, exception is to pre- 
compute and store the transitive closure C\ = (Vx,E x ) of the complete XML 
graph Gx — (Vx,-Ey). Cx is a very time-efficient connection index, but is 
wasteful in terms of space. Therefore, its effectiveness with regard to memory 
usage tends to be poor (for large data that does not entirely fit into memory) 
which in turn may result in excessive disk I/O and poor response times. 

To compute the transitive closure, time 0(|K| 3 ) is needed using the Floyd- 
Warshall algorithm (see Section 26.2 of [11]). This can be lowered to 0(|K| 2 + 
|K| • l-E]) using Johnson’s algorithm (see Section 26.3 of [11]). Computing tran- 
sitive closures for very large, disk-resident relations should, however, use disk- 
block-aware external storage algorithms. We have implemented the “semi-naive” 
method [3] that needs time 0(\E' X \ ■ |K|). 

3 Review of the 2— Hop Cover 

3.1 Example and Definition 

A 2-hop cover of a graph is a compact representation of connections in the graph 
that has been developed by Cohen et al. [9]. Let T = {( u,v)\there is a path from 
u to v in G} the set of all connections in a directed graph G = (V. E) (i.e., T 
is the transitive closure of the binary relation given by E ) . For each connection 
(u,v) G (i.e., (u,v) £ T) choose a node w on a path from u to v as a center 
node and add w to a set L out (u ) of descendants of u and to a set L in (v ) of 
ancestors of v. Now we can test efficiently if two nodes u and v are connected 
by a path by checking if L out (u ) fl Li n {v ) = 0. There is a path from u to v iff 
L out (u ) fl Li n (v) ^ 0; and this connection from u to v is given by a first hop 
from u to some w G L out (u) fl Li n {v) and a second hop from w to v, hence the 
name of the method. 
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Fig. 1. Collection of XML Documents which include 2-hop labels for each node 



As an example consider the XML document collection in Figure 1 with in- 
formation for the 2-lrop cover added. There is a path from u=(2, bibliography) 
to v= (6, paper), and we can easily test this because the intersection L out (u ) fl 
Li„{v ) = {3} is not empty. 

Now we can give a formal definition for the 2-hop cover of a directed graph. 
Our terminology slightly differs from that used by Cohen et al. While their 
concepts are more general, we adapted the definitions to better fit our XML 
application, leaving out many general concepts that are not needed here. 

A 2-lrop label of a node v of a directed graph captures a set of ancestors and 
a set of descendants of v. These sets are usually far from exhaustive; so they do 
not need to capture all ancestors and descendants of a node. 

Definition 1 (2— Hop Label). Let G = ( V,E ) be a directed graph. Each node 
v € V is assigned a 2-lrop label L{v) = ( Li n {v ), L out (v)) where Li n (v),L out (v) C 
V such that for each node x £ Li n (v) there is a path (x . . . v) in G and for each 
node y £ L out (v), there is a path (v . . .y) in G. □ 

The idea of building a connection index using 2-lrop labels is based on the 
following property. 

Theorem 1. For a directed graph G = (V) E) let u,v € V be two nodes with 2- 
hop labels L(u) and L(v). If there is a node w £ V such thatw £ L out (u)r\Li n (v) 
then there is a path from u to v in G. □ 

Proof. This is an obvious consequence of Definition 1. □ 

A 2-hop labeling of a directed graph G assigns to each node of G a 2-lrop 
label as described in Definition 1. A 2-hop cover of a directed graph G is a 2-lrop 
labeling that covers all paths (i.e., all connections) of G . 
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Definition 2 (2— Hop Cover). Let G = (V. E) be a directed graph. A 2-hop 
cover is a 2-hop labeling of graph G such that if there is a path from a node u 
to a node v in G then L 0 ut(u) n L in (v) 0. □ 

We define the size of the 2-hop cover to be the sum of the sizes of all node 
labels: J2 v ev(\ L ™( v )\ + \ L out{v)\). 



3.2 Computation of a 2— Hop Cover 

To represent the transitive closure of a graph, we are, of course, interested in a 
2-lrop cover with minimal size. However, as the minimum set cover problem can 
be reduced to the problem of finding a minimum 2-lrop cover for a graph, we are 
facing an NP-lrard problem [11,9]. So we need an approximation algorithm for 
large graphs. Cohen et al. introduce a polynomial-time algorithm that computes 
a 2-lrop cover for a graph G = (V, E ) whose size is at most by a factor of 
0(log | V|) larger than the optimal size. We now sketch this algorithm. 

Let G = (V, E) be a directed graph and G' = (V, T) be the transitive closure 
of G. For a node w £ V, Ci n (w ) = {u £ V\(v, w) £ T} is the set of nodes v £ V 
for which there is a path from v to w in G (i.e., the ancestors of w). Analogously, 
for a node w £ V, C out (w) = {i> £ V|(ty,t;) £ T} is the set of nodes v £ V for 
which there is a path from w to v in G (i.e., the descendants of w). 

For a node w £ V let S(Ci n (w),w,C ou t(w)) = {(tt, v) £ T\u £ 
Ci n {w ) and v £ C ou t(w)} = {(m, v) £ T\(u,w) £ T and (w,v) £ T} denote 
the set of paths in G that contain w. The node w is called center of the set 
S(C in (w),w,C out (w)). 

For a given 2-lrop labeling that is not yet a 2-lrop cover let T' C T be the set 
of connections that are not yet covered. Thus, the set S(Ci n (w),w , C ou t(w))r\T' 
contains all connections of G that contain w and are not covered. The ratio 

\S(C in (w),w,C out (w)) nr'l 
_ \C in {w)\ + \C out (w)\ 

describes the relation between the number of connections via w that are not yet 
covered and the total number of nodes that lie on such connections. 

The algorithm for computing a nearly optimal 2-lrop cover starts with T' = T 
and empty 2-lrop labels for each node of G. The set T' contains, at each stage, 
the set of connections that are not yet covered. In a greedy manner the algorithm 
chooses the “best” node w £ V that covers as many not yet covered connections 
as possible using a small number of nodes. If we choose w with the highest value 
of r(w), we arrive at a small set of nodes that covers many of the not yet covered 
connections but does not increase the size of the 2-lrop labeling too much. After 
w, C in (w) . C out (vj) are selected, its nodes are used to update the 2-lrop labels: 



for all v £ C in (w ) : L out (v) := L out (v) U {u>} 
for all v £ Cout(w ) : L in (v ) := L in [v) U {«;} 
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and then S(Ci n (w),w,C ou t(w)) will be removed from T' . The algorithm termi- 
nates when the set T' is empty, i.e., when all connections in T are covered by 
the resulting 2-hop cover. 

For a node w £ V there are an exponential number of subsets 
C i n (w), C ou t(w ) C V which must be considered in a single computation step. So, 
the above algorithm would require exponential time for computing a 2-hop cover 
for a given set T, and thus needs further considerations to achieve polynomial 
run-time. 

The problem of finding the sets Ci n (w), C out (w) C V for a given node w £ V 
that maximizes the quotient r is exactly the problem of finding the densest 
subgraph of the center graph of w. We construct an auxiliary undirected bipartite 
center graph CG W = (' V W ,E W ) of node w as follows. The set V w contains two 
nodes Vi„ and v out for each node v £ V of the original graph. There is an 
undirected edge ( u ou t , Vi„) £ E w if and only if ( tt, v ) £ T' is still not covered and 
u £ Ci n (w) and v £ C ou t(w). Finally, all isolated nodes can be removed from 
CG W . 

Figure 2 shows the center graph of node w = 6 for the graph given in Figure 1. 

Definition 3 (Center Graph). Let G = (V, E ) be a directed graph. For a given 
2-hop labeling let T' CT be the set of not yet covered connections in G, and let 
w £ V. The center graph CG W = ( V W ,E W ) of w is an undirected, bipartite graph 
with node set V w and edge set E w . The set of nodes is V w = V) n U V ou t where 
Vi n = {ui n \u € V : 3v £ V : (u,v) £ T' and u £ Ci n {w ) and v £ C ou t{w ) and 
V out = {vout \v £ V : 3u £ V : (u,v) £ T' and u £ C in [w) and v £ C ou t(w). 
There is a undirected edge (■ Ut n ,v out ) £ E w if and only if (u,v) £ T' and u £ 
Cin.(w) and v £ C nu t.(w ). 

The density of a subgraph is the 
average degree (i.e., number of incom- 
ing and outgoing edges) of its nodes. 
The densest subgraph of a given cen- 
ter graph CG W can be computed by 
a linear-time 2-approximation algo- 
rithm which iteratively removes a node 
of minimum degree from the graph. 
This generates a sequence of sub- 
graphs and their densities. The algo- 
rithm returns the subgraph with the 
highest density, i.e., the densest sub- 
graph CG' W = (Vf, E' w ) of the given 
center graph CG W where density is the ratio of the number of edges to the 
number of nodes in the subgraph. We denote the density of this subgraph by d w . 

Definition 4 (Densest Subgraph). Let CG = ( V,E ) be an undirected graph. 
The densest subgraph problem is to find a subset V' C V such that the average de- 
gree d of nodes of the subgraph CG' = ( V’,E ') is maximized where d = |£’ , |/|W|. 
Here, E' is the set of edges of E that connect two nodes ofV'. 



d - , V in v ou 




Fig. 2. Center graph of node w = 6 (la- 
beled “paper”) 
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The refined algorithm for comput- 
ing a 2-hop cover chooses the “best” 
node w out of the remaining nodes 
in descending order of the density of 
the densest subgraph CG' W = ( V( n U 
V'outi E' v] ) of the center graph CG W = 
( Vin U Vout,E w ) of w. Thus, we ef- 
ficiently obtain the sets C in (w) = 
V- n , C out (w) = V' out for a given node 
w with maximum quotient r(w). 

So this consideration yields a polynomial-time algorithm for computing a 
2-hop cover for the set T of connections of the given graph G. 

Constructing the 2-hop cover has time complexity 0(|E| 3 ), because for com- 
puting the transitive closure of the given graph G using the Floyd- Warslrall- 
Algorithm [11] the algorithm needs time 0(|E| 3 ) and for computing the 2-hop 
cover from the transitive closure the algorithm needs time 0(|E| 3 ). (The first 
step computes the densest subgraphs for — V — nodes, the second step computes 
the densest subgraphs for - -V — nodes, etc., yielding 0(\V\ 2 ) computations each 
with worst-case complexity 0(|V|).) 

The 2-lrop cover requires at most space 0{\ V| • \/\E\), yielding 0( \V\ 2 ) in the 
worst case. However, it can be shown that for undirected trees the worst-case 
space complexity is 0(n ■ logn); Cohen et al. state in [9] that the complexity 
tends to remain that favorable for graphs that are very tree-similar (i.e. , that can 
be transformed into trees by removing a small number of edges), which would 
be the case for XML documents with few links. Testing the connectivity of two 
nodes, using the 2-lrop cover, requires time 0{L) on average, where L is the 
average size of the label sets of nodes. Experiments show that this number is 
very small for most nodes in our XML application (see Section 6). 

4 Efficient and Scalable Construction of the HOPI Index 

The algorithm by Cohen et al. for computing the 2-lrop cover is very elegant 
from a theoretical viewpoint, but it has problems when applied to large graphs 
such as large-scale XML collections: 

— Exhaustively computing the densest subgraph for all center graphs in each 
step of the algorithm is very time-consuming and thus prohibitive for large 
graphs. 

— Operating on the precomputed transitive closure as an input parameter is 
very space-consuming and thus a potential problem for index creation on 
large graphs. 

Although both problems arise only during index construction (and are no 
longer issues for index lookups once the index has been built), they are critical 
in practice for many applications require online index creation in parallel to the 
regular workload so that the processing power and especially the memory that 







density d=8/7=1 .14 density d=7/6=1 .17 density d=6/5=1 .2 density d=4/4=1.0 ...d=0.67 

...d=0.5 

...d=0 

Fig. 3. Densest subgraph of a given center 
graph 
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is available to the index builder may be fairly limited. In this section we show 
how to overcome these problems and present the scalable HOPI index construc- 
tion method. In Subsection 4.1 we develop results that can dramatically reduce 
the number of densest-subgraph computations. In Subsection 4.2 we develop a 
divide-and-conquer method that can drastically alleviate the space-consumption 
problem of initially materializing the transitive closure and also speeds up the 
actual 2-lrop-cover computation. 



4.1 Efficient Computation of Densest Subgraphs 

A naive implementation of the polynomial-time algorithm of Cohen et al. would 
recompute the densest subgraph of all center graphs in each step of the algorithm, 
yielding 0(|H| 2 ) such computations in the worst case. However, as in each step 
only a small fragment of all connections is removed, only a few center graphs 
change; so it is unnecessary to recompute the densest subgraphs of unchanged 
center graphs. Additionally, it is easy to see that the density of the densest 
subgraph of a centergraplr will not increase if we remove some connections. 

We therefore propose to precompute the density d w of the densest subgraph of 
the center graph of each node w of the graph G at the beginning of the algorithm. 
We insert each node w in a priority queue with d w as priority. In each step of 
the algorithm, we then extract the node m with the current maximum density 
from the queue and check if the stored density is still valid (by recomputing 
d m for this node). If they are different, i.e., the extracted value is larger than 
d rn , another node w may have a larger d w ; so we reinsert m with its newly 
computed d w as priority into the queue and extract the current maximum. We 
repeat this procedure until we find a node where the stored density equals the 
current density. Even though this modification does not change the worst-case 
complexity, our experiments show that we have to recompute d w for each node 
w only about 2 to 3 times on average, as opposed to 0(|V|) computations for 
each node in the original algorithm. Cohen et al. also discuss a similar approach 
to maintaining precomputed densest subgraphs in a heap, but their technique 
requires more space as they keep all centergraphs in memory. 

In addition, there is even more potential for optimization. In our experiments, 
it turned out that precomputing the densest subgraphs took significant time 
for large graphs. This precomputation step can be dramatically accelerated by 
exploiting additional properties of center graphs that we will now derive. 

We say that a center graph is complete if there are edges between each node 
u € Vj n and each node v G V out . We can then show the following lemma: 

Lemma 1. Let G=(V,E) a directed graph and T' a set of connections that are 
not yet covered. A complete subgraph CG' W of the center graph CG W of a node 
w € V is always its densest subgraph. □ 



Proof. For a complete subgraph CG' W , \V( n \ ■ V' out \ = \E' W \ holds. A simple com- 
putation shows that the density d = (\Vfj ■ \Vf ut \) / {\V[ n \ + \Vf ut \) of this graph 
is maximal. □ 
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Using this lemma, we can show that the initial center graphs are always their 
densest subgraph. Thus we do not have to run the algorithm to find densest 
subgraphs but can immediately use the density of the center graphs. 

Lemma 2. Let G=(V,E) a directed graph and T' = T the set of connections 
that are not yet covered. The center graph CG W of a node w € V is itself its 
densest subgraph. □ 

Proof. We show that the center graph is always complete, so that the claim 
follows from the previous lemma. Let T the set of all connections of a directed 
graph G. We assume there is a node w such that the corresponding center graph 
is not complete. Thus, the following three conditions hold: 

1. there are two nodes u £ Vi n ,v £ V ou t such that (u,v) E w 

2. there is at least one node x £ V out such that ( u,x ) £ E w 

3. there is at least one node y £ V in such that ( y,v ) £ E w 

As described in Definition 3 the second and third condition induce that (u,w), 
(w,x) £ T and ( y,w ), (w, v) £ T. But if (u,w) £ T and (w,v) £ T then (u,v) £ 
E w . This is a contradiction to our first condition. Therefore, the initial center 
graph of any node w is complete. □ 

Initially, the density of the densest subgraph of center graph for a node w can 
be computed as d w = \E. W \/ {\V W \). Although our little lemma applies only to the 
initial center graphs, it does provide significant savings in the precomputation: 
our experiments have shown that the densest subgraphs of 100,000 nodes can be 
computed in less than one second. 

4.2 Divide-and-Conquer Computation of the 2— Hop Cover 

Since materializing the transitive closure as the input of the 2-lrop-cover com- 
putation can be very critical in terms of memory consumption, we propose a 
divide-and-conquer technique based on partitioning the original XML graph so 
that the transitive closure needs to be materialized only for each partition sep- 
arately. Our technique works in three steps: 

1. Compute a partitioning of the original XML graph. Choose the size of each 
partition (and thus the number of partitions) such that the 2-hop-cover com- 
putation for each partition can be carried out with memory-based data struc- 
tures. 

2. Compute the transitive closure and the 2-lrop cover for each partition and 
store the 2-lrop cover on disk. 

3. Merge the 2-hop covers for partitions that have one or more cross-partition 
edges, yielding a 2-lrop cover for the entire graph. 

In addition to eliminating the bottleneck in transitive closure materialization, 
the divide-and-conquer algorithm also makes very efficient use of the available 
memory during the 2-hop-cover computation and scales up well, and it can even 
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be parallelized in a straightforward manner. We now explain how steps 1 and 3 
of the algorithm are implemented in our prototype system; step 2 simply applies 
the algorithm of Section 3 with the optimizations presented in the previous 
subsection. 

Graph Partitioning. The general partitioning problem for directed graphs 
can be stated as follows: given a graph G = (V,E), a node weight function fy : 
V — > N, an edge weight function fg : E — > N and a maximal partition weight 
M, compute a set P = {Pi, . . . , V p } of partitions of G such that V = U{* =1 p, for 
each Vi fv( v ) < M, and the cost 

c := -fe( e ) 

eeEr\u i7 ijVixVj 

of the partitioning is minimized. We call the set E c := E n (u tyjVi x Vj) the 
set of cross-partition edges. 

This partitioning problem is known to be NP-hard, so the optimal partition- 
ing for a large graph cannot be efficiently computed. However, the literature 
offers many good approximation algorithms. In our prototype system, we im- 
plemented a greedy partitioning heuristics based on [13] and [7]. This algorithm 
builds one partition at a time by selecting a seed node and greedily accumu- 
lating nodes by traversing the graph (ignoring edge direction) while trying to 
keep E c as small as possible. This process is repeated until the partition has 
reached a predefined maximum size (e.g., the size of the available memory). We 
considered several approaches for selecting seeds, but none of them consistently 
won. Therefore, seeds are selected randomly from the nodes that have not yet 
been assigned to a partition, and the partitioning is recomputed several times, 
finally choosing the partitioning with minimal cost as the result. 

In principle, we could invoke this partitioning algorithm on the XML element 
graph with all node and edge weights uniformly set to 1. However, the size 
of this graph may still pose efficiency problems. Moreover, we can exploit the 
fact that we consider XML data where most of the edges can be expected to 
be intra-document parent-child edges. So we actually consider only the much 
more compact document graph (introduced in Subsection 1.2) in the partitioning 
algorithm. The node weight of a document is the number of its elements, and the 
weight of an edge is the number of links from elements of edge-source document 
to elements of the edge-target document. This choice of weights is obviously 
heuristic, but our experiments show that it leads to fairly good performance. 

Cover Merging. After the 2-lrop covers for the partitions have been computed, 
the cover for the entire graph is built by forming the union of the partitions’ cov- 
ers and adding information about connections induced by cross-partition edges. 
A cross-partition edge x — > y may establish new connections from the ancestors 
of x to the descendants of y if x and y have not been known to be connected 
before. To reflect this new connection in the 2-hop cover for the entire graph, we 
choose a; as a center node and update the labels of other nodes as follows: 
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for all a £ ancestors^) : L out (a ) := L out (a) U {x} 

for all d £ descendants(y) U {y} : L in (d ) := L in (d) U {a;} 

As x may not be the optimal choice for the center node, the resulting index 
may be larger than necessary, but it correctly reflects all connections. 

5 Implementation Details 

As we aim at very large, dynamic XML collections, we implemented HOPI as a 
database-backed index structure, by storing the 2-lrop cover in database tables 
and running SQL queries against these tables to evaluate XPath-like queries. Our 
implementation is based on Oracle 9i, but could be easily carried over to other 
database platforms. Note that this approach automatically provides us with 
all the dependability and manageability benefits of modern database systems, 
particularly, recovery and concurrency control. For storing the 2-lrop cover, we 
need two tables LIN and LOUT that capture L in and L out : 

CREATE TABLE LIN( CREATE TABLE L0UT( 

ID NUMBER (10) , ID NUMBER ( 10) , 

INID NUMBER(IO) ) ; OUTID NUMBER (10) ) ; 

Here, ID stores the ID of the node and INID/OUTID store the node’s label, with 
one entry in LIN/LOUT for each entry in the node’s corresponding Li n /L out sets. 
To minimize the number of entries, we do not store the node itself as INID or 
OUTID values. For efficient evaluation of queries, additional database indexes are 
built on both tables: a forward index on the concatentation of ID and INID 
for LIN and on the concatentation of ID and OUTID for LOUT, and a backward 
index on the concatentation of INID and ID for LIN and on the concatentation 
of OUTID and ID for LOUT. In our implementation, we store both LIN and LOUT 
as index-organized tables in Oracle sorted in the order of the forward index, so 
the additional backward indexes double the disk space needed for storing the 
tables. 

Additionally we maintain information about nodes in Gx in the table NODES 
that stores for each node its unique ID, its XML tag name, and the url of its 
document. 



Connection Test. To test if two nodes identified by their ID values ID1 and 
ID2 are connected, the following SQL statement would be used if we stored the 
complete node labels (i.e., did not omit the nodes themselves from the stored 
Lin and L out labels): 

SELECT COUNT (*) FROM LIN, LOUT WHERE L0UT.ID=ID1 AND LIN.ID=ID2 

AND LOUT. 0UTID=LIN. INID 

This query performs the intersection of the L out set of the first node 
with the Li n set of the second node. Whenever the query returns a 
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non-zero value, the nodes are connected. It is evident that the back- 
ward indexes are helpful for an efficient evaluation of this query. As 
we do not store the node itself in its label, the system executes the 
following two additional, very efficient, queries that capture this case: 

SELECT C0UNT(*) FROM LIN, LOUT WHERE L0UT.ID=ID1 AND LOUT . 0UTID=ID2 
SELECT COUNT!*) FROM LIN, LOUT WHERE LIN.ID=ID2 AND LIN.INID=ID1 

Again it is evident that the backward and the forward index speed up query 
execution. For ease of presentation, we will not mention these additional queries 
in the remainder of this section anymore. 



Compute Descendants. To compute all descendants of a given node with ID 
ID1, the following SQL query is submitted to the database: 

SELECT LIN. ID FROM LIN, LOUT WHERE L0UT.ID=ID1 

AND LOUT . 0UTID=LIN . INID 

It returns the IDs of the descendants of the given node. Using the forward index 
on LOUT and the backward index on LIN, this query can be efficiently evaluated. 



Descendants with a Given Tag Name. As the last case in this subsection, 
we consider how to determine the descendants of a given node with ID ID that 
have a given tag name N. The following SQL query solves this case: 

SELECT LIN. ID FROM LIN, LOUT, NODES WHERE L0UT.ID=ID1 

AND LOUT. 0UTID=LIN. INID AND LIN . ID=N0DES . ID AND NODES. NAME=N 

Again, the query can be answered very efficiently with an additional index on 
the NAMES column of the NODES table. 

6 Experimental Evaluation 

6.1 Setup 

In this section, we compare the storage requirements and the query performance 
of HOPI with other, existing path index approaches, namely 

— the pre- and postorder encoding scheme [15,16] for tree-structured XML 
data, 

— a variant of APEX [6] without optimization for frequently used queries 
(APEX-0) that was adapted to our model for the XML graph, 

— using the transitive closure as a connection index. 

We implemented all strategies as indexes of our XML search engine XXL [24,25] . 
However, to exclude any possible influences of the XXL system on the measure- 
ments, we measured the performance independently from XXL by immediately 
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calling the index implementations. As we want to support large-scale data that 
do not fit into main memory, we implemented all strategies as database ap- 
plications, i.e., they read all information from database tables without explicit 
caching (other than the usual caching in the database engine). 

All our experiments were run on a Windows-based PC with a 3GHz Pentium 
IV processor, and 4 GByte RAM. We used a Oracle 9.2 database server than 
ran on a second Windows-based PC with a 3GHz Pentium IV, 1GB of RAM, 
and a single IDE hard disk. 



6.2 Results with Real-Life Data 

Index Size. As a real-life example for XML data with links we used the XML 
version of the DBLP collection [20]. We generated one XML doc for each 2nd- 
level element in DBLP (article, inproceedings, ...) plus one document for 
the top-level dblp document and added XLinks that correspond to cite and 
crossref entries. The resulting document collection consists of 419,334 docu- 
ments with 5,244,872 elements and 63,215 links (plus the 419,333 links from the 
top-level document to the other documents). To see how large HOPI gets for 
real-life data, we built the index for two fragments of DBLP: 

— The fragment consisting of all publications in EDBT, ICDE, SIGMOD and 
VLDB. It consists of 5,561 documents with totally 141,140 nodes and 9,105 
links. The transitive closure for this data has 5,651,952 connections that 
require about 43 Megabytes of storage (2x4 bytes for each entry, without 
distance information). HOPI built without partitioning the document graph 
resulted in a cover of size 231,596 entries requiring about 3.5 Megabytes of 
storage (2x4 bytes for each entry plus the same amount for the backward 
index entry); so HOPI is about 12 times more compact than the transitive 
closure. Partitioning the graph into three partitions and then merging the 
computed covers yielded a cover of size 251,315 entries which is still about 11 
times smaller than the transitive closure. Computing this cover took about 
16 minutes. 

— The complete DBLP set. The transitive closure for the complete DBLP set 
has 306,637,532 entries requiring about 2.4 Gigabytes of storage. With par- 
titioning the document graph into 53 partitions of size 100,000 elements, we 
arrived at an overall cover size of 27,190,122 entries that require about 415 
Megabytes of storage; this is a compression factor of about 5.8. Computing 
this cover took about 24 hours without any parallelization. About | of the 
time was spent on computing the partition covers; merging the covers con- 
sumed most of the time because of many SQL statements executed against 
the PC-based low-end database server used in our experiments (where espe- 
cially the slow IDE disk became the main bottleneck). 

Storage needed for the pre- and postorder labels for the tree part of the 
data (i.e., disregarding links which are not supported by this approach) was 
2x4 bytes per node, yielding about 1 Megabyte for the small set and about 40 
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Megabytes for complete DBLP. For APEX-0, space was dominated by the space 
needed for storing the edge extents, that required in our implementation storing 
4 additional bytes per node (denoting the node of the APEX graph in which this 
node resides) and 2x4 bytes for each edge of the XML graph (node that HOPI 
does not need this information), yielding an overall size of about 1.7 megabytes 
for the small set and about 60.5 megabytes for complete DBLP. 



Query Performance. For studying query performance, we concentrated on 
comparing HOPI against the APEX-0 path index, one of the very best index 
structures that supports parent-child and ancestor-descendant axes on arbitrary 
graphs. 

Figure 4 shows the wall-clock time 
to test whether two given elements 
are connected, averaged over many 
randomly chosen inproceedings and 
author element pairs, as a function 
of the distance between the elements. 

The figure shows that HOPI performs 
one or two orders of magnitude bet- 
ter than APEX-0 and was immune to 
increases in the distance. 

For path queries with wildcards 
but without any additional condi- 




Fig. 4. Time to test connection of two 
nodes at varying distances 



tions, such as //inproceedings//author, HOPI outperformed APEX-0 only 
marginally. Note, however, that such queries are rare in practice. Rather we 
would expect additional filter predicates for the source and/or target elements; 
and with conventional index lookups for these conditions the connection index 
would be primarily used to test connectivity between two given elements as 
shown in Figure 4 above. 

Figure 5 shows the wall-clock time 
to compute all descendants of a given 
node, averaged over randomly chosen 
nodes, as a function of the number 
of descendants. (We first randomly se- 
lected source nodes, computed their 
descendants, and later sorted the re- 
sults by the number of descendants.) 

Again, HOPI can beat APEX-0 by one 
or two orders of magnitude. But, of 
course, we should recall that APEX 
has not been optimized for efficient de- 
scendant lookups, but is primarily de- 
signed for parent-child navigation. 

Finally, for finding all descendants of a given node that have a given tag 
name, HOPI was about 4 to 5 times faster than APEX-0 on average. This was 




Fig. 5. Time to compute all descendants 
for a given node 
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measured with randomly chosen inproceedings elements and finding all their 
author descendants. 

6.3 Scalability Results with Synthetic Data 

To systematically assess our index with respect to document size and frac- 
tion of links, we used the XMach benchmark suite [5] to generate a collection 
of synthetic documents. A document had about 165 elements on average. We 
randomly added links between documents, where both the number of incom- 
ing and outgoing links for each document was chosen from a Zipf distribution 
with skew parameter 1.05, choosing high numbers of outgoing links (“hubs”) 
for documents with low ID and high numbers of incoming links (“authori- 
ties”) for documents with high ID. For each link, the element from which it 
starts was chosen uniformly among all elements within the source document, 
and the link’s destination was chosen 
as the root element of the target doc- 
ument, reflecting the fact that the ma- 
jority of links in the Web point to the 
root of documents. 

Figure 6 shows the compression ra- 
tio that HOPI achieves compared to 
the materialized transitive closure as 
the number of documents increases, 
with one outgoing and one incoming 
link per document on average (but 
with the skewed distribution discussed 
above) . The dashed curve in the figure 
is the index build time for HOPI. For 
the collection of 20,001 documents that consisted of about 3.22 million elements, 
HOPI’s size was about 96 megabytes as compared to about 37 megabytes for 
APEX-0. 

Figure 7 shows the compression 
ratio and the index build time as 
the number of links per document in- 
creases, for a fixed number, 1001, of 
documents. At an average link den- 
sity of five links per document, HOPI’s 
size was about 60 megabytes, whereas 
APEX-0 required about 4 megabytes. 

The compression ratio ranges from 
about 5 to more than an order of mag- 
nitude. 

These results demonstrate the dra- 
matic space savings that HOPI can 
achieve as compared to the transitive closure. As for index build time, HOPI 
nicely scales up with increasing number of documents when the number of links 
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Fig. 7. Compression factor of HOPI vs. 
transitive closure, with varying number of 
links per document 




Number of documents 



Fig. 6. Compression factor of HOPI vs. 
transitive closure, with varying number of 
documents 
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is kept constant, whereas Figure 7 reflects the inevitable superlinear increase in 
the cost as the graph density increases. 

7 Conclusion 

Our goal in this work has been to develop a space- and time-efficient index struc- 
ture that supports XML path queries with wildcards such as /book/ /author, 
regardless of whether the qualifying paths are completely within one document 
or span documents. We believe that HOPI has achieved this goal and signifi- 
cantly outperforms previously proposed XML index structures for this type of 
queries while being competitive for all other operations on XML indexes. Our ex- 
perimental results show that HOPI is an order of magnitude more space-efficient 
than an index based on materializing the transitive closure of the XML graph, 
and still significantly smaller than the APEX index. In terms of query perfor- 
mance, HOPI substantially outperforms APEX for path queries with wildcards 
and is competitive for child and parent axis navigation. 

The seminal work by Cohen et al. on the 2-hop cover concept provided ex- 
cellent algorithmic foundations to build on, but we had to address a number of 
important implementation issues that are decisive for a practically viable system 
solution that scales up with very large collections of XML data. Most impor- 
tantly, we developed new solutions to the issues of efficient index construction 
with limited memory. Our future work on this theme will include efficient algo- 
rithms for incremental updates and further improvements of index building by 
using more sophisticated algorithms for graph partitioning. 
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Abstract. Though skyline queries already have claimed their place in retrieval 
over central databases, their application in Web information systems up to now 
was impossible due to the distributed aspect of retrieval over Web sources. But 
due to the amount, variety and volatile nature of information accessible over the 
Internet extended query capabilities are crucial. We show how to efficiently 
perform distributed skyline queries and thus essentially extend the 
expressiveness of querying today's Web information systems. Together with 
our innovative retrieval algorithm we also present useful heuristics to further 
speed up the retrieval in most practical cases paving the road towards meeting 
even the real-time challenges of on-line information services. We discuss 
performance evaluations and point to open problems in the concept and 
application of skylining in modern information systems. For the curse of 
dimensionality, an intrinsic problem in skyline queries, we propose a novel 
sampling scheme that allows to get an early impression of the skyline for 
subsequent query refinement. 



1 Introduction 

In times of the ubiquitous Internet the paradigm of Web information systems has 
substantially altered the world of modern information acquisition. Both in business 
and private life the support with information that is stored in a decentralized manner 
and assembled at query time, is a resource that users more and more rely on. Consider 
for instance Web information services accessible via mobile devices. First useful 
services like city guides, route planning, or restaurant booking have been developed 
[5], [2] and generally all these services will heavily rely on information distributed 
over several Internet sources possibly provided by independent content providers. 
Frameworks like NTT DoCoMo’s i-mode [18] already provide a common platform 
and business model for a variety of independent content providers. 

Recent research on web-based information systems has focused on employing 
middleware algorithms, where users had to specify weightings for each aspect of their 
query and a central compensation function was used to find the best matching objects 
[7], [1]. The lack of expressiveness of this ‘top k’ query model, however, has first 
been addressed by [8] and with the growing incorporation of user preferences into 



E. Bertino et at. (Eds.): EDBT 2004. LNCS 2992. pp. 256-273, 2004. 
© Springer-Verlag Berlin Heidelberg 2004 




Efficient Distributed Skylining for Web Information Systems 



257 



database systems [6], [10] and information services [22] the limitations of the entire 
model became more and more obvious. This led towards the integration of so-called 
‘skyline queries’ (e.g. [4]) into database systems. Basically the ‘skyline’ is a non- 
discriminating combination of numerical preferences under the notion of Pareto 
optimality. Since it was only proposed for database systems working over a central 
(multi-dimensional) index structure, extending its expressiveness also to the broad 
class of Web information systems is most desirable. The contribution of this paper is 
to undertake this task and present an efficient algorithm with proven optimality. We 
will present a distributed skylining algorithm and show how to enhance its efficiency 
for most practical cases by suitable heuristics. We will also give an extensive 
performance evaluation and propose a scheme to cope with high-dimensional 
skylines. 

As a running example throughout this paper we will focus on a typical Web 
information service scenario. Our algorithm will support a sample user interacting 
with a route planning service like e.g. Map-Quest’s Road Trip Planner or Driving 
Directions [16]. This is a characteristic example of a Web service where the gathering 
of on-line information is tantamount: though a routing service is generally capable of 
finding possible routes, the quality of certain routes - and thus their desirability to the 
user - may heavily differ depending on current information like road blockings, traffic 
jams or the weather conditions. Thus we first have to collect a set of user- specified 
preferences and integrate them with our query to the routing system. But of course 
users won’t be able to specify something like ‘for my purposes the shortest route is 
0.63 times more important than that there is no jam’ in a sensible, i.e. intuitive, way. 
Queries rather tend to be formulated like ‘I would prefer my route to be rather short 
and with little jams’ giving no explicit weightings for a compensation function. Hence 
the skyline over the set of possible routes is needed for a high quality answer set. 
Since there often are many sources on the Internet offering current traffic information, 
usually the query also will have to be posed to a variety of sources needing an 
efficient algorithm for the distributed skyline computation. The example of a route 
planning service also stresses the focus on real time constraints, because most on-line 
information like traffic jams or accidents will have to be integrated on the fly and 
delivered immediately to be of use for navigation. Since such efficient algorithms for 
distributed retrieval are still problematic, today’s web portals like Map-Quest allow 
only a minimum of additional information (e.g. avoiding toll roads) and use central 
databases, that provide necessary information. However, given the dynamic nature of 
the Web this does not really meet the challenges of Web information systems. 



2 Web Information Systems Architecture and Related Work 

Modern Web information systems feature an architecture like the one roughly 
sketched in figure 1. Using a (mobile) client device the user poses a query. Running 
on an application/Web server this query may be enriched with information about a 
user (e.g. taken from stored profiles) and will be posed to a set of Internet sources. 
Depending on the nature of the query different sources can be involved in different 
parts of the query, e.g. individual sources for traffic jams or weather information. 
Collecting the individual results the combining engine runs an algorithm to compute 
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the overall best matching objects. These final results have then to be aggregated 
according to each individual user’s specifications and preferences. After a 
transformation to the appropriate client format (e.g. using XSLT with suitable 
stylesheets) the best answers will be returned to the user. 

The first area to address such a distributed retrieval problem was the area of ‘top k 
retrieval’ over middleware environments, e.g. [7], [9], [19]. Especially for content- 
based retrieval of multimedia data these techniques have proven to be particularly 
helpful. Basically all algorithms distinguish between different query parts 
(subqueries) evaluating different characteristics, which often have to be retrieved 
from various subsystems or web sources. Each subsystem assesses a numerical score 
value (usually normalized to [0,1]) to each object in the collection. The middleware 
algorithms use two basic kinds of accesses that can be posed: there is the iteration 
over the best results from one source (or with respect to a single aspect of the query) 
called a ‘ sorted access’ and there is the so-called ‘ random access’ that retrieves the 
score value with respect to one source or aspect for a certain given object. 

The physical implementation of these accesses always strongly depends on the 
application area and will usually differ from system to system. The gain of speeding 
up a single access (e.g. using a suitable index) will of course complement the total 
run-time improvement by reducing the overall number of accesses. Therefore 
minimizing the number of necessary object accesses and thus also the overall query 
runtimes is tantamount to build practical systems (with real-time constraints) [1]. 
Prototypical Web information systems of that kind are e.g. given by [3], [5] or [2], 
However, all these top k retrieval systems relied on a single combining function 
(often called ‘utility function’) that is used to compensate scores between different 
parts of the query. Being worse in one aspect can be compensated by the object doing 
better in another part. However, the semantic meaning of these (user provided) 
combining functions is unclear and users often have to guess the ‘right’ weightings 
for their query. The area of operations research and research in the field of human 
preferences like [6] or [8] has already since long criticized this lack in expressiveness. 

A more expressive model of non-discriminating combination has been introduced 
into the database community by [15]. The ‘skyline’ or ‘Pareto set’ is a set of non- 
dominated answers in the result for a query under the notion of Pareto optimality. The 
typical notion of Pareto optimality is that without knowing the actual database 
content, there can also be no precise a-priori knowledge about the most sensible 
optimization in each individual case (and thus something that would allow a user to 
choose weightings for a compensation function). The Pareto set or skyline hence 
contains all best matching object for all possible strictly monotonic optimization 
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functions. An example for skyline objects with respect to two query parts and their 
scorings S, and S 2 is shown in figure 2. Each database object is seen as a point in 
multidimensional space characterized by its score values. For instance objects o x 
=(0.9, 0.5) and o y =(0.4, 0.9) both dominate all objects within a rectangular area 
(shaded). But o x and o y are not comparable, since o x dominates o y in S and o y 
dominates o x in S,. Thus both are part of the skyline. 

Whereas [15] and the more recent extensive system in [12] with an algebra for 
integrating the concept of Pareto optimality with the top k retrieval model for 
preference engineering and query optimization in databases [11], are more powerful 
in that they do not restrict skyline queries to numerical domains, they both rely on the 
naive algorithm of quadratic complexity doing pairwise comparisons of all database 
objects. Focusing on numerical domains [4] was able to gain logarithmic complexity 
along the lines of [14]. Initially skyline queries were mainly intended to be performed 
within a single database query engine. Thus the first algorithms and subsequent 
improvements all work on a central (multidimensional) index structure like R*-trees 
[20], certain partitioning schemes [21] or k-nearest-neighbor searches [13]. However, 
such central indexes cannot be applied to distributed Web information systems. Since 
there is still no algorithm to process distributed skyline queries, up to now the 
extension of expressiveness of the query model could not be integrated in Web 
information services. We will deal with the problem of designing an efficient 
distributed algorithm for computing skyline queries only relying on sorted and 
random accesses. 



3 A Distributed Skylining Algorithm 

In this section we will investigate distributed skylining and present a first basic 
algorithm. As we have motivated in the previous section the basic skyline consists of 
all non-dominated database objects. That means all database objects for which there is 
no object in the database that is better or equal in all dimensions, but in at least one 
aspect strictly better. Assuming every database object to be represented by a point in 
n-dimensional space with the coordinates for each dimension given by its scores for 
the respective aspect, we can formulate the problem as: 
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The Skyline Problem: Given set O :={o 1 ,...,o N } of N database objects, n score- 
functions s,,...s n with Sj : O — * [0,1] and n sorted lists Sj,...,S n containing all database 
objects and their respective score values using one of the score function for each list; 
all lists are sorted descending by score values starting with the highest scores. Wanted 
is the subset P of all non-dominated objects in O, i.e. { o, e P | —.3 Oj eO : 
( s,(o,) < s^Oj) A. . .a s n (Oi) < s a (o) A 3 q e [ 1,. . ,,n] : s q (o,) < s q (Oj)) } 

We will now approach a suitable distributed algorithm to efficiently find this set. 
Our algorithm basically consists of three phases: The first phase (step 1) will perform 
sorted accesses until we have definitely seen all objects that can possibly be part of 
the skyline. The second phase (step 2 and 3) will extend the accesses on all objects 
with minimum seen scores in the lists and will prune all other database objects. The 
third phase (step 4) will employ focused random accesses to discard all seen objects 
that are dominated before returning the skyline to the user. To keep track of all 
accessed objects we will need a central datastructure containing all available 
information about all objects seen, but also group the objects with respect to the 
sorted lists that they have occurred in. The beauty of this design is that we only have 
to check for domination within the small sets for each list and can return some first 
results early. 

Basic Distributed Skyline Algorithm 

0. Initialize a datastructure P := 0 containing records with an identifier and n real 
values indexed by the identifiers, initialize n lists Kj,...,K n := 0 containing records 
with an identifier and a real value, and initialize n real values p ] ,...,p n := 1 

1. Initialize counter i := 1. 

1.1. Get the next object o ncw by sorted access on list Sj 

1.2. If Ojj ew g P, update its record’s i-th real value with sjo ncw ), else create such a 
record in P 

1.3. Append o new with Sj(o nljw ) to list K 

1.4. Set Pj := Sj(o n J and i := (i mod n) +1 

1.5. If all scores sfo^J (1< j < n) are known, proceed with step 2 else with step 1.1. 

2. For i = 1 to n do 

2.1. While Pj = s 1 (o llew ) do sorted access on list Sj and handle the retrieved objects 
like in step 1.2 to 1.3 

3. If more than one object is entirely known, compare pairwise and remove the 
dominated objects from P. 

4. For i = 1 to n do 

4.1. Do all necessary random accesses for the objects in K that are also in P, 
immediately discard objects that are not in P 

4.2. Take the objects of 1C and compare them pairwise to the objects in K. If an 
object is dominated by another object remove it from Kj and P 

5. Output P as the set of all non-dominated objects 

For ease of understanding we show how the algorithm works for our running 
example: for mobile route planning in [2] we have shown for the case of top k 
retrieval how traffic information aspects can be queried from various on-line sources. 
Posing a query on the best route with respect to say its length (Sj) and the traffic 
density (S,) our user employs functions that evaluate the different aspects, but is not 
sure how to compensate length and density. The following tables show two result lists 
with some routes Rj ordered by decreasing scores with respect to their length and 
current traffic density: 




Efficient Distributed Skylining for Web Information Systems 



261 



Si (length) 


Rl 


R3 


R5 


R4 


R7 




0.9 


0.9 


0.8 


0.8 


0.7 





S 2 (traffic density) 


R2 


R4 


R6 


R3 


R8 




0.9 


0.8 


0.8 


0.8 


0.7 





The algorithm in step 1 will in turn perform sorted accesses on both lists until the 
first route R4 has been seen in both lists leading to the following potential skyline 
objects: 



Route 


Rl 


R2 


R3 


R4 


R5 


R6 


Score Si 


0.9 


? 


0.9 


0.8 


0.8 


? 


Score S 2 


? 


0.9 


? 


0.8 


? 


0.8 



In step 2 we will do some additional sorted accesses on all routes that possibly 
could also show the current minimum score in each list and find that R7 in S, already 
has a smaller score, hence we can discard it. In contrast R3 in S 2 has the current 
minimum score, hence we have to add it to our list, but can then discard the next 
object R8 in S 2 , which does have a lower score. Step 3 now tests, if one of the two 
completely seen routes R3 and R4 is dominated: by comparing their scores we find 
that R4 is dominated by R3 and can thus be discarded. We can now regroup objects 
into sets K, and do all necessary random accesses and the final tests for domination 
only within each set. 



1 Ki ! 


Rl 


R3 


R5 


0.9 


0.9 


0.8 


? 


0.8 


? 



1 k 2 | 


R2 


R6 


R3 


? 


? 


0.9 


0.9 


0.8 


0.8 



Step 4 now works on the single sets K r We have to make a random access on R1 
with respect to S 2 and find that its score is say 0.5. Thus we get its score pair (0.9, 0.5) 
and have to check for domination within set K,. Since it is obviously dominated by 
the score pair (0.9, 0.8) of R3, we can safely discard Rl. Doing the same for object 
R5 we may retrieve a value of say 0.6 thus R5’s pair (0.8, 0.6) is also dominated by 
R3 and we are finished with set K r Please note that we could have saved this last 
random access on R5, since we already know that all unknown scores in S 2 must be 
smaller than the current minimum of the respective list (in this case 0.8). This would 
already have shown R5’s highest possible score pair (0.8, 0.8) to be dominated by R3. 
At this point we are already able to output the non-dominated objects of K r since 
lemma 2 shows that if any of the objects of set K should be dominated by objects in 
another set K , they also always would be dominated by an object in K r 

Dealing with K 2 we have to make random accesses for S on routes R2 and R6 and 
find for route R2 a score value of say 0.6 leading to a score pair of (0.6, 0.9). But it 
cannot be dominated by R3’s pair (0.9, 0.8), as its score in S 2 is higher than R3’s. 
Finally for R6 we may find a score of say 0.2, thus it is dominated by R3 and can be 
discarded (also in these cases we could have saved two random accesses like shown 
above). We now can deliver the entire skyline to our user’s query consisting of routes 
R3 and R2. All other routes in the database (either seen or not yet accessed) are 
definitely dominated by at least one of these two routes. 
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Independently of any weightings a user could have chosen for a compensation 
function thus either route R3 or R2 would have turned up as top object dominating all 
other routes. Delivering them both as top objects saves users from having to state 
unintuitive a-priori weightings and allows for an informed choice according to each 
individual user’s preferences. But we still have to make sure, that upon termination no 
pruned object can belong to the skyline and no dominated object will ever be 
returned. We will state two lemmas and then prove the correctness of our algorithm. 
Lemma 1 (Discarding unseen objects) 

After an object o x has been seen in each list and all objects down to at least score s i (o x ) 
(1 < i < n) in each list have also been seen, the objects not yet seen cannot be part of 
the skyline, more precisely they are dominated by o x . 

Proof: Let o x be the object seen in all lists. Then with p, as the minimum score seen in 
each list we have due to the sorting of the lists V i (1 < i < n) : S;(o x ) > p,. Since all 
objects having a score of at least score p,in list i have been collected, we can conclude 
that any not yet seen object o unseen satisfies V i (1 < i < n) : s i( 0 „ nS een) < Pi - s i (° x ) and 
thus V i (1 <i< n) : s.(o ) < s.(o ). Hence o is dominated by o and thus cannot 

v 7 i v unseen-' t v x 7 unseen J x 

be part of the skyline independently of o x itself being part of the set or being 
dominated. ■ 

We can even show the somewhat stronger result that, if we have seen an object o x 
in all lists and stop the sorted accesses in step 2 after seeing only a single worse object 
in any of the lists, we can still safely discard all unseen objects. This is because we 
need the strict “<’ in one single list only. In the other lists ‘<’ would still be sufficient 
(since due to sorting ‘=’ is the highest possible). 

Lemma 2 (Objects can only be dominated by objects in the same set K.) 

Assume that all objects that have been seen, are divided into n sets according to the 
lists in which they occurred, i.e. if an object o x occurs in list i (1 < i < n) it is added to 
set K r Lets further assume that in the lists in which o x occurred, all objects having at 
least the respective score value of o x have also been seen. Then, if the object o x in any 
set Kj is dominated by any other object, this object has also to be part of set K r 
Proof: Let o x be any dominated object already seen and assigned to at least one set K 
(1 < i < n). Due to Lemma 1 o x cannot be dominated by any unseen object, thus the 
dominating object o has already been seen and thus has also been assigned to at least 
one of the sets K; (1 < i < n). If o x and o y are in exactly the same sets there is nothing 
to show. Thus let us assume o y dominates o x and there exists at least one set K | ( I < j < 
n) containing object o x , but not object o y . Thus due to the sorting of the lists and the 
fact that we have seen all objects in list x having at least the respective score value 
s j (o x ) of object o x , we have to conclude that s^Oy) < s j (o x ) in contrast to the assumption 
that o dominates o . ■ 

y x 

Theorem 1: (Correctness of the Basic Algorithm) 

The basic algorithm always terminates and delivers the entire set of non-dominated 
objects and only the set of non-dominated objects. 

Proof: Since the termination is obvious, we have to show that a) no relevant object is 
missed by our algorithm and that b) no object in the returned set can be dominated by 
any other object. 

Ad a) Steps 1 and 2 of the algorithm collect all objects, until one object has been 
seen in all lists and all objects of the minimum score in each list have also been seen. 
Thus Lemma 1 applies and we can safely discard all unseen objects, since they cannot 
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be part of the skyline. In steps 3 and 4 only dominated objects are discarded (in step 4 
we also might use upper boundary estimations of some scores for discarding, but 
since the upper boundaries are best case estimations, it is obvious that objects 
discarded in step 4 would also be discarded using their actual score values). Thus the 
set returned in step 5 will contain all objects of the skyline. 

Ad b) After steps 1 to 3 Lemma 2 applies and we can restrict the search for 
dominating objects to the sets Kj (1 < i < n). Since in any set 1C no object can be 
dominated by an object having a strictly smaller score with respect to the i-th list, it is 
sufficient to do pairwise comparisons with those objects having a larger or equal 
score. Thus Step 4 correctly discards all dominated objects within the sets 1C and 
since all objects returned in step 5 must have been part of at least one 1C, they cannot 
be dominated by any object. ■ 

Since the algorithm is supposed to work with distributed web sources thus having 
rather high access costs, for the optimality and complexity considerations we have to 
focus on the necessary object accesses instead of main memory operations. The next 
theorem will show that the termination condition of phase 1 is optimal, since one of 
the objects seen in all lists is definitely part of the skyline. Stopping earlier would thus 
discard possibly non-dominated objects; we therefore have to see an object in all lists. 
Theorem 2 (Optimality of Sorted Accesses): 

The basic algorithm uses an optimal number of sorted accesses. 

Proof: Sorted accesses are only made during the first two steps. After the first step 
one object has been seen in all lists. In step 2 we do further sorted accesses to get all 
objects with at least one score equal to the respective minimum score in each list. If 
we can show that among these objects and the object seen in all lists there always is at 
least one object belonging to the skyline, we could not stop doing sorted accesses 
earlier and thus use an optimal number of sorted accesses. 

Let o x be the first object that has occurred in all lists and ( 1 < i < n) be the 
minimum scores in each list. Then we get V i Sj(o x ) > |i and s k (o x ) = p t for at least one 
1 < k < n. For every object o ^ o x seen during step 1 of our algorithm there is at least 
one list Sj ( 1 < j < n) in which o has not been seen and hence we have either s jo) < p 
< Sj(o x ) (A) or s/o) = Pj (B). 

If case (A) applies for an object o, it cannot dominate the object o x . Thus object o x 
can only be dominated by an object for which case (B) applies, i.e. one of those 
objects that have occurred during step 2 of our algorithm. Choose among those 
objects the maximum object o m dominating o x . We will then show by contradiction 
that o m belongs to the skyline: 

If o m would not be part of the skyline, it would have to be dominated by another 
object. Due to Lemma 1 o m cannot be dominated by any unseen object, and due to 
being maximal among those objects occurring in step 2, it would have to be 
dominated by an object o seen in step land not seen in step 2 of the algorithm. This 
means, that there is an index j, such that s.(o) is smaller than p (cf. case 1 above). 
Therefore o can dominate neither o x nor o m leading to the contradiction. ■ 
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Rank S x S 2 . . . S n 




+ round robin 



Fig. 3. Savings implemented by heuristic 1 



4 Improvements by Advanced Heuristics 

Having shown that we will have to see at least one object in all the lists we will now 
focus on heuristics to find this object that causes the first phase to terminate more 
quickly and will try to minimize the necessary comparisons within the sets K. 

Consider the situation shown in figure 3. Having adopted a round robin strategy in 
our basic algorithm, we have to expand all the lists until an object (e.g. o x ) occurs in 
all lists. But our proof of correctness allows us to immediately disregard even all 
those objects that have only occurred in any list after o x (i.e. the shaded areas). Thus 
using all information about discovered objects at an early stage and employing a 
sophisticated control flow, we can improve our algorithm by immediately focusing on 
objects that can be assumed to foster early termination by being the first object to 
occur in all lists with reasonably high probability. Having chosen such an object we 
will no longer do sorted accesses on lists in which this object has already occurred, 
but rather expand lists in which its score is still unknown. Therefore we need to know 
how to find an object that is most probable to terminate our algorithm. In case studies 
on multi-objective optimization like [3] one of the most effective functions in 
estimating dominating objects is a greedy strategy called ‘maximin’ function. Our 
heuristic for estimating an appropriate object has been built along the lines of this 
function. But whereas the ‘maximin’ function only focuses on the maximum value 
after evaluating the minimum scores for each object and thus advocates the smallest 
possible expansion in every list, our heuristic additionally will take advantage of the 
fact that because of the sorting of the lists and recent sorted accesses on it, we exactly 
know the current score value in each list and thus can better estimate the necessary 
expansion. 

Heuristic 1: If all scores of an object are known (either by sorted or by random 
access) consider all scores value in lists where it has not yet been seen by sorted 
access. If we sum up the difference between these values and the last score value seen 
by sorted access on the respective list, we get an aggregated value for each object. 
The object with a minimum value can be considered the most promising object. It still 
needs the least expansion in all lists. Therefore it is probable to be the object that will 
first occur in all of the lists. If there are more objects with the same minimum score, 
the one with the minimum sum of scores will need the least expansion of all lists. 
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To find these objects, we will mix sorted and random accesses already in the first 
phase to immediately get all information about an object. Since these accesses would 
have been necessary in the third phase of the basic algorithm anyway, by doing them 
immediately we just spend a few random accesses too much for those objects that we 
might already have seen by sorted access in more than one list. Knowing all scores we 
can now estimate how far we would have to expand all the lists, if we had to see the 
latest object in all lists adding up the differences between values seen by random 
access and the current score in the respective list. Focusing only on the best object 
with respect to necessary list expansions, we will employ indicators that tell us which 
lists to expand next, avoiding those lists our object has already occurred in. 

Since we gather all information about an object at its first occurrence and 
immediately assess its probable utility for termination, we will not expand any list 
more than necessary. If we can choose several lists for the next sorted access, we can 
either pick one randomly, or, if we expect non-uniform data distributions a 
complementing indicator technique e.g. using the derivatives of the score distribution 
function in each list along the lines of [9] may be employed to estimate the expected 
gain in each list. Please note that our heuristic 1 will not affect the abstract order of 
complexity from our previously stated optimality results, because the maximum 
improvement factor over the round robin strategy can only be the number of lists (n). 
But, given the rather expensive costs of object accesses over the Internet even small 
numbers of accesses saved will improve the overall run-time behavior like shown in 
[5] or [1], Thus, also improvements taking only constant factors off the algorithm’s 
complexity should be employed towards meeting real-time constraints. 

Our second heuristic will focus on the necessary comparisons within the sets K. 
Obviously no object having a smaller score with respect to S; will be able to dominate 
any object having a larger score. Thus we do not really need the pairwise comparisons 
like suggested in the basic algorithm. We only have to compare pairwise between 
objects within the same set Ki having equal scores and can otherwise test, if the 
objects with smaller scores are dominated by ones having larger score values. 

Heuristic 2: Start with the objects first seen in each set K and compare pairwise 
all objects with the same score value. Then only test for domination by objects with 
higher scores. 

To implement this we employ the fact that since the lists are ordered, also all K, are 
ordered. We will use two counters q and b and divide each K into subsets grouping 
same score values. Starting with the first set we will assume the objects as enumerated 
and set q to the number of the first object of a subset and b to the number of the last. 
According to heuristic 2 we don’t need any comparisons with objects on numbers 
larger than b, we need pairwise comparisons for all objects between q and b and we 
need a test for domination by all objects with numbers smaller than q. 

Using our heuristics we will now present our improved algorithm for distributed 
skyline queries. Again we need the initialization of a central datastructure for the set 
of possible skyline objects and sets containing the objects for each sorted list as 
before. Additionally we need a variable for the object that is considered most 
promising to terminate our algorithm. Note that all necessary random accesses are 
now already performed in step 1 in order to derive a greedy estimation of the object 
most probable to foster early termination. 
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Improved Distributed Skyline Algorithm 

0. Initialize a datastructure P := 0 containing records with an identifier and n real 

values for scores indexed by the identifiers, initialize n lists K I? . . ,,K n := 0 
containing records with an identifier and one real value, initialize a record 
term_oid containing an identifier and a real value := 0 and initialize n real values 
P. Pn : = 1 

1. Initialize counter i := 1. 

1.1. Get the next object o new by sorted access on list S P set p. := S;(o new ) and update 
the real value in term_oid according to step 1.3 

1.2. If o g! P 

new 

1.2.1. Create a record in P containing oid and score in S i in the i-th entry in its 
record. 

1.2.2. Do random accesses on all missing scores and update the record in P like 
above 

1.3. Add up the difference between o new ’s score values in lists, where it has not yet 
been seen, and the p : in these lists 

1 .4. If this sum is smaller than the value in term_oid, replace the oid and the value 
in term_oid with the oid and new value of o new 

1.5. If the sum is equal to the value in term_oid, replace like in 1.4 only, if the total 
sum of scores for o ncw is larger than the sum for the object given by term_oid 

1.6. Append o with s.(o ) to list K 

1.7. Set i to any number of a set 1C in which the object given by term_oid has not 
yet occurred. If it is element of all K proceed with step 2 else with step 1.1. 

2. Let o term be the object given by term_oid. For i = 1 to n do 

2.1. While Pj = Sj(o term ) do sorted access on list S, and update p. like in step 1.4, 
append it to list 1C like in step 1.3 and if the retrieved objects is not in P handle it 
like in step 1.2.1 and 1.2.2 

3. For i = 1 to n do 

3.1. q :=0, b:= 0 

3.2. While there is an q+l-th entry in K do 

3.2.1. Repeat (collect an object of K, and set b:= b+1) until the value of the b+l-th 
entry in 1C is strictly smaller than the b-th entry or there is no b+l-th entry 

3.2.2. If any collected object does not exist in P, discard the object and remove it 
from Kj and set b := b-1 

3.2.3. Compare the collected objects pairwise. If any of these objects is dominated, 
discard it and remove it from 1C and P and set b := b-1 

3.2.4. Compare all collected objects pairwise to all objects being on a position 
smaller or equal than q in K,. If any collected object is dominated, discard it and 
remove it from K. and P and set b := b-1 

3.2.5. Set q := b 

4. Output P as the set of all non-dominated objects, i.e. the skyline. 

Since the correctness of the improved algorithm is straightforward along the lines 
of the basic algorithm and also the optimality holds, we will again return to our 
example to show how the algorithm works. Again we pose a query on the best route 
with respect to its length (Sj) and the traffic density (Sj. The following tables show 
our result lists with routes ordered by scores: 
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Si (length) 


Rl 


R3 


R5 


R4 


R7 




0.9 


0.9 


0.8 


0.8 


0.7 





S 2 (traffic density) 


R2 


R4 


R6 


R3 


R8 




0.9 


0.8 


0.8 


0.8 


0.7 





The algorithm in step 1 will perform sorted accesses on S, and finds route Rl. A 
random access will reveal Rl’s second score 0.5 and that its sum of unseen values is 
( 1 .0-0.5 )=0.5. That means our first estimation is that we will have to expand the list S 2 
down to score 0.5 in order to see Rl in all lists. Thus we have to do a sorted access on 
list S, trying to decrease the scores to find Rl’s second score, and we get route R2. 
The second score of R2 leads to a sum of differences of 0.3. Thus it is more promising 
than Rl and we will focus on lists where R2 has not yet occurred. Accessing S, we 
encounter object R3, whose second score 0.8 again leads to a change in our term_oid 
to R3 with value 0.1. After we have also accessed R4 and R6 in list S, who both show 
larger sums, we finally encounter R3 and can terminate step 1 . 



Route 


Rl 


R2 


R3 


R4 


R6 


Score Si 


0.9 


0.6 


0.9 


0.8 


0.2 


Score S 2 


0.5 


0.9 


0.8 


0.8 


0.8 


term_oid 


Rl 


R2 


R3 


R3 


R3 


next access 


s 2 


Si 


s 2 


s 2 


s 2 



In step 2 we will do some additional accesses on all routes also showing the current 
minimum score in each list and find that R5 in S, already has a smaller score, hence 
we can discard it, and we can also discard the next object R8 in S,. 



1 Ki | 


Rl 


R3 


0.9 


0.9 



1 k 2 ! 


R2 


R6 


R4 


R3 


0.9 


0.8 


0.8 


0.8 



Step 3 now focuses on the sets 1C and finds that in K R3 dominates Rl and in K, 
we first have to compare R6, R4 and R3 pairwise and find that R3 dominates all and 
then only have to test, if R3 is dominated by R2. However, as R2 does not dominate 
R3 we can return them as the skyline. Please note that besides more efficient 
comparisons within the K i? even in this limited example our indicator technique 
already saved us expensive object accesses on routes R5 and R7, which now remain 
unseen. 



5 Evaluation of Distributed Skylining 

The presented algorithm for the first time addresses the problem of distributed 
skylining in Web information systems, thus in our evaluation we obviously cannot 
compare it to similar algorithms. Since comparisons with algorithms over central 
indexes (which of course will be faster not having to deal with network latencies) will 
also yield no sensible results, we will concentrate on the necessary number of object 
accesses, the total number of objects in the skyline for some practical cases and the 
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Average Improvement Factor 
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Fig. 4. Improvement due to heuristics 



Fig. 5. Saved accesses w.r.t. database size 



improvements that can be gained over the basic algorithm by using our advanced 
heuristics. For all experiments we used an independent data distribution of scores. 

Let us first take a glance on the savings due to our heuristics and then evaluate the 
performance of our improved algorithm. We will focus on the improvement factors in 
terms of overall object accesses saved. Figure 4 shows the average improvement 
factors for different numbers of lists (3,5, and 10) and two different database sizes of 
10000 and 100000 database objects. We can clearly see that independent of the 
database size the average improvement factors for our experiments range between 1.5 
for small numbers of lists and around 2.5 for higher numbers. Thus, even using just 
these simple heuristics without any tuning we instantly halve the necessary object 
accesses. We can even expect higher factors by tuning the given heuristic to adapt 
more closely to the data distribution like shown e.g. in [9]. 

Now we can concentrate on the object accesses that our algorithm saves with respect 
to the database size. Figure 5 shows what percentage of the database can be pruned, 
again for different numbers of lists and different database sizes. We can see clearly 
that our algorithm scales well with the database size and for lower numbers of lists 
works well, e.g. prunes more than 95% over 3 lists. However, we can also see that the 
performance quickly deteriorates with a growing number of lists. To explain this 
behavior we have to consider the portions of skyline objects among all objects that 
have been accessed (cf. figure 6). We find that, though our algorithm’s performance 
seems to deteriorate with growing numbers of lists, its precision in terms of how 
many objects that are not part of the skyline have to be accessed, heavily increases 
with growing numbers of lists. For instance in the case of 10 lists over a database of 
10000 objects almost 60% of the accesses are definitely necessary to see the entire 
skyline, i.e. to terminate the algorithm correctly. Considering this instance 




Fig. 6. Skyline objects among all objects accessed 
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Table 1 . Size of the skyline with respect to different numbers of database objects and lists 



Size of database (N) 


Number of lists 


Size of skyline (in % of database size) 


10,000 


3 


0.51 




5 


4.44 




10 


49.25 


100,000 


3 


0.07 




5 


1.00 




10 


25.11 



further we can conclude that, if we access about 90% of 10000 objects and about 55% 
of them are necessary, the skyline has to be about 49.5% of the entire database. 

To support these considerations we performed more experiments on the actual 
average size of the skyline for varying numbers of lists and different database sizes. In 
table 1 we can see that our considerations have been correct (also confirmed by 
experiments in [4]). Indeed the size of the skyline rapidly increases with larger 
numbers of lists. We are forced to conclude that, though the concept of skylining may 
be a very intuitive model for querying, its output behavior seems only to be feasible 
for rather small numbers of lists to combine. In fact skyline sizes grow exponentially 
with the number of dimensions. Thus, independently of the retrieval algorithms the 
problem itself does not scale and we still need a effective dimensionality reduction for 
skyline queries that are probable to retrieve huge results. 



6 Sampling the Efficient Frontier for Improved Scalability 

So even if an algorithm could compute high-dimensional skylines in acceptable time, 
it would still not be sensible to return something like 50% of database objects to the 
user for manual processing. If on the other hand, users first aggregate all lists in which 
a compensation between scores can be defined, and then use the skyline query model 
only for modest numbers of these aggregated lists, the skyline will consist of sensible 
numbers of elements and can be retrieved reasonably well. But how to know, which 
dimensions can be compensated and over which dimensions we still need a skyline? 
As pointed out in [4] specific characteristics of dimensions like correlation have an 
essential influence on the manageability of the resulting skyline. Correlated data 
usually results in smaller skylines than the independently distributed case. In contrast 
anti-correlated distributions amount in a vast increase of the number of skyline 
objects. Measures to assess such characteristics that hint at the size of the result, are 
for example the objects’ average consistency of performance, i.e. if scores for each 
object show similar absolute values in all different dimensions. The hope is to see in 
advance e.g. if there are correlations between some dimensions, which in turn could 
be condensed into a single dimension. Since computing skylines of small numbers of 
dimensions (say 3) are still not at all problematic, our main idea is to get an 
impression of the original characteristics of the skyline by investigating skylines of 
some representative low-dimensional subsets of the original dimensions. The 
following theorem states that -without having to calculate the high-dimensional 
skyline- our sampling can nevertheless rely on actual skyline objects, which in turn 
improves the sampling’s quality. 
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Theorem 3 (Skyline of Subsets of Dimensions): 

For each object o in the skyline of a subset of the dimensions (i.e. a subset of score 
lists) there is always a corresponding object o’ in the skyline of all dimensions having 
exactly the same scores as o with respect to the subset of dimensions. 

Proof: Assume that we have chosen an arbitrary subset of score lists. We can then 
calculate the skyline P of this subset. Let o be any object of P. We have to show that 
there is a corresponding object o’ in the skyline Q for all score lists having same 
scores in the chosen subset. If o already is also part of Q the statement is trivially true. 
Thus let us assume that o is not element of Q and therefore must be dominated by at 
least one object p. That means for all lists s(p) > s t (o) holds. If, however, considering 
only the chosen lists there would be any some list for which ‘strictly better’ holds, i.e. 
s.(p) > s,.(o), object o would already be dominated by p with respect to our subset. 
Since this would be in contradiction to our assumption of o being part of the skyline 
of the subset, for the entire subset s .(p) = s.(o) has to hold and p is our object o’. ■ 

Using this result we will now propose the sampling scheme. We will sample the 
skyline in three steps: choosing q subsets of the lists, calculating their lower- 
dimensional skylines and merging the results as the subsequent sampling. Since 
skylines can already grow large for only 4 to 5 dimensions, we will always sample 
with three-dimensional subsets. Values of q = 5 for 10 score lists and q = 15-20 for 15 
score lists in our experiments have provided sufficient sampling quality. For 
simplicity we just take the entire low-dimensional skyline (2.1)and merge it (2.2). As 
theorem 3 shows, should two objects feature the same score within a low-dimensional 
skyline, random accesses on all missing dimensions could be used to rule out a few 
dominated objects sometimes. We experimented with this (more exact) approach, but 
found it to perform much worse, while improving the sampling quality only slightly. 

Sampling Skylines by Reduced Dimensions 

1. Given m score lists randomly select q three-dimensional subsets, such that all 

lists occur in at least one of the subsets. Initialize the sampling set P := 0 

2. For each three-dimensional subset do 

2. 1 . Calculate the skyline P. of the subset 

2.2. Union with the sampling set P := P u P 

3. The set P is a sample of the skyline for all m score lists 

Now we have to investigate the quality of our sampling. An obvious quality 
measure is the manageability of the sample; its number of objects should be far 
smaller than the actual skyline. Also the consistency of performance is also an 
interesting measure, because larger number of consistent objects will mean some 
amount of correlation and therefore hints at rather small skylines. Our actual 
measurement here takes the perpendicular distance between each skyline object and 
the diagonal in score space [0,1]° normalized to a value between 0 and 100% and 
aggregated within 10% intervals. The third measure will be a cluster analysis dividing 
each score dimension into upper and lower half, thus getting 2“ buckets. Our cluster 
analysis counts the elements in each bucket and groups the buckets according to the 
number of ‘upper halves’ (i.e. score values > 0.5) they contain. Again having more 
elements in the clusters with either higher or lower numbers of ‘upper halves’ indicate 
correlation, whereas objects in the buckets with medium numbers hint at anti- 
correlation. 

Our experiments on how adequately the proposed sampling technique predicts the 
actual skyline, will focus on a 10-dimensional skyline for a database containing 
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Fig. 8. Cluster analysis for the 10-dim sample 

N= 1 00,000 objects. Score values over all dimensions have been uniformly distributed 
and statistical averages over multiple runs and distributions have been taken. We have 
fixed q := 5 and compare our measurement against the quality of a random sample of 
the actual 10-dimensional skyline, i.e. the best sample possible (which, however, in 
contrast to our sample cannot be taken without inefficiently calculating the high- 
dimensional skyline). Since our sample is expected to essentially reduce the number 
of objects, we will use a logarithmic axis for the numbers of objects in all diagrams. 

We have randomly taken 5 grips of 3 score lists and processed their respective 
skylines like shown in our algorithm. Measuring the manageability we have to 
compare the average size of the 10-dim skyline and our final sample: the actual size 
of the skyline is on average 25133.3 objects whereas our sample consists of only 
313.4 objects, i.e. 1.25% of the original size. Figure 7 shows the consistency of 
performance measure for the actual skyline, our sample and a random sample of about 
the same size as our sample. The shapes of the graphs are quite accurate, but whereas 
the peaks of the actual set (dark line) and its random sample (light line) are aligned, 
the peak for our sampling (dashed line) is slightly shifted to the left. We thus 
underestimate the consistency of performance a little, because when focusing on only 
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a subset of dimensions, some quite consistent objects may ‘hide’ behind optimal 
objects with respect to these dimensions, having only slightly smaller scores, but 
nevertheless a better consistency. But this effect only can lead to a slight 
overestimation of the skyline’s size and thus is in tune with our intentions of 
preventing the retrieval of huge skylines. Figure 8 addresses our cluster analysis. 
Again we can see that our sampling graph snugly aligns with the correct random 
sampling and the actual skyline graph. Only for the buckets of count 3 there is a slight 
irritation, which is due to the fact that we have sampled using three dimensions and 
thus have definitely seen all optimal objects with scores >0.5 in these three 
dimensions. Thus we slightly overestimate their total count. Overall we see that our 
sampling strategy with reduced dimensions promises -without having to calculate the 
entire skyline- to give us an impression of the number of elements of the skyline 
almost as accurate as a random sample of the actual skyline would provide. Using this 
information for either safely executing queries or passing them back to the user for 
reconsideration in the case of too many estimated skyline objects seems promising to 
lead to a better understanding and manageability of skyline queries. 



7 Summary and Outlook 

We addressed the important problem of skyline queries in Web information systems. 
Skylining extends the expressiveness of the conventional ‘exact match’ or the ‘top k’ 
retrieval models by the notion of Pareto optimality. Thus it is crucial for intuitive 
querying in the growing number of Internet-based applications. Distributed Web 
Information services like [5] or [2] are premium examples benefiting from our 
contributions. In contrast to traditional skylining, we presented a first algorithm that 
allows to retrieve the skyline over distributed data sources with basic middleware 
access techniques and have proven that it features an optimal complexity in terms of 
object accesses. We also presented a number of advanced heuristics further improve 
performance towards real-time applications. Especially in the area of mobile 
information services [22] using information from various content providers that is 
assembled on the fly for subsequent use, our algorithm will allow for more expressive 
queries by enabling users to specify even complex preferences in an intuitive way. 
Confirming our optimality results our performance evaluation shows that our 
algorithm scales with growing database sizes and already performs well for 
reasonable numbers of lists to combine. To overcome the deterioration for higher 
numbers of lists (curse of dimensionality) we also proposed an efficient sampling 
technique enabling us to estimate the size of a skyline by assessing the degree of data 
correlation. This sampling can be performed efficiently without computing high- 
dimensional skylines and its quality is comparable to a correct random sample of the 
actual skyline. 

Our future work will focus on the generalization of skylining and numerical top k 
retrieval towards the problem of multi-objective optimization in information systems, 
e.g. over multiple scoring functions like in [10]. Besides we will focus more closely 
on quality aspects of skyline queries. In this context especially a-posteriori quality 
assessments along the lines of our sampling technique and qualitative assessments 
like in [17] may help users to cope with large result sets. We will also investigate our 
proposed quality measures in more detail and evaluate their individual usefulness. 
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Abstract. Given the current trend towards application interoperability 
and XML-based data integration, there is an increasing need for XML 
interfaces to relational database management systems. In this paper we 
consider the problem of rewriting a DB-to-XML mapping, into several 
modified mappings in order to support clients that require various por- 
tions of the mapping-defined data. Mapping rewriting has the effect of 
reducing the amount of shipped data and, potentially, query processing 
time at the client. We ship sufficient data to correctly answer the client 
queries. Various techniques to further limit the amount of shipped data 
are examined. We have conducted experiments to validate the usefulness 
of our shipped data reduction techniques in the context of the TPC-W 
benchmark. The experiments confirm that in reasonable applications, 
data reduction is indeed significant (60-90%). 



1 Introduction 

Due to the increasing popularity of XML, enterprise applications need to effi- 
ciently generate and process XML data. Hence, native support for XML data 
is being built into commercial database engines. Still, a large part of today’s 
data currently resides in relational databases and will probably continue to do 
so in the foreseeable future. This is mainly due to the extensive installed base of 
relational databases and the availability of the skills associated with them. How- 
ever, mapping between relational data and XML is not simple. The difficulty in 
performing the transformation arises from the differences in the data models of 
relational databases (relational model) and XML objects (a hierarchical model 
of nested elements) . 

Currently, this mapping is usually performed as part of the application code. 
A significant portion of the effort for enabling an e-business application lies 
in developing the code to perform the transformation of data from relational 
databases into an XML format or to store the information in an XML object in 
a relational database, increasing the cost of e-business development. Moreover, 
specifying the mapping in the application code makes the maintenance of the 
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Fig. 1. Typical scenario 



application difficult, since any change in the database schema, the XML schema 
or the business logic requires a new programming effort. 

A better approach is to externalize the specification of the mapping, and re- 
place the programming effort by the simpler effort of writing a declarative map- 
ping that describes the relationship between the XML constructs and the corre- 
sponding RDBMS constructs. Several notations for defining mappings have been 
proposed: some are based on DTD or XMLSchema annotations [LCPC01,Mic], 
others are based on extensions to SQL or XQuery [XQ]. Indeed, all major com- 
mercial DBMS products (e.g., Oracle 8i, IBM DB2 and Microsoft SQLServer) 
support some form of XML data extraction from relational data. All of these 
notations specify mappings between the rows and columns of tables in the rela- 
tional model onto the elements and attributes of the XML object. Our objective 
is not to define yet another mapping notation. Instead, we introduce an abstract, 
notation-neutral, internal representation for mappings, named tagged tree , which 
models the constructs of the existing notations. 

Consider the following typical enterprise usage scenario. An e-commerce com- 
pany A owns a large relational database containing order related information. 
A has signed contracts with a number of companies, which may be divisions of 
A , C\,C 2 , ■■ ■ ,Cn that would like to use A’s data (for example, they want to 
mine it to discover sales trends). The contracts specify that the data is to be 
delivered as XML. So, A needs to expose the content of its database as XML. 
The database administrator will therefore define a generic mapping, called the 
DB-to-XML mapping that converts the key-foreign key relationships between 
the various tables to parent-child relationships among XML elements. This is 
illustrated in Figure 1. 

Let us now describe typical use cases for clients: 

— Pre-defined queries (that perhaps can be parameterized) against periodically 
generated official versions of the data (e.g., price tables, addresses, shipping 
rates) . 

— Ad-hoc queries that are applied against the XML data. 

In most enterprises, the first kind by far dominates the second kind. We therefore 
focus on the periodical generation of official data versions. We note that ad-hoc 
queries can also be handled by dynamically applying the techniques we explore 
in this paper, on a per-query basis. 
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An obvious option is to execute the DB-to-XML mapping and ship each 
client the result. Since the mappings are defined in a generic fashion in order to 
accommodate all possible users, they may contain a large amount of irrelevant 
data for any particular user. Thus such a strategy would result in huge XML 
documents which will not only be expensive to transmit over the network but 
also will be expensive to query by the interested parties. We therefore need to 
analyze alternative deployment strategies. 

Consider a client C with set of (possibly parameterizable) queries QS. 1 
Let X be the DB-to-XML mapping defined by A over its database D. Instead 
of shipping to C the whole XML data, namely X(D), A would like to ship 
only data that is relevant for QS (and that should produce, for queries in QS, 
the same answers as those on X(D)). We show that determining the minimum 
amount of data that can be shipped is NP - hard and most probably cannot be 
done efficiently. Nevertheless, we devise efficient methods that for many common 
applications generate significantly smaller amounts of shipped data as compared 
with X(D). 



1.1 An Example 

Consider the DB-to-XML mapping X defined by the tagged tree in Figure 2. 
Intuitively, the XML data tree specified by this mapping is generated by a depth- 
first traversal of the tagged tree, where each SQL query is executed and the re- 
sulting tuples are used to populate the text nodes (we defer the formal definition 
of data tree generation to Section 2). 

Now, consider the following query set /polist/po [status = ’processing’] 

/orderline/item, /polist/po[status = ’processing’] /customer, /polist/po [status = 
’pending’] /billTo, /polist/po [status = ’pending’] /customer }. 




z := select * from Address 
where ajd = x.o_bill_to_id 



Fig. 2. Original tagged tree 



1 QS may be a generalization of the actual queries which may be too sensitive to 
disclose. 
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z := select * from Address 
where ajd = x.o_bill_to_id 
and x.status = ‘pending’ 



Fig. 3. Modified tagged tree 

Clearly, in order to support the query set QS, we do not need to compute 
the full data tree defined by the mapping X, because only some parts of the 
resulting data tree will be used by these queries, while others are completely 
ignored by these queries (and therefore useless). By examining the query set, we 
realize that the only purchase orders (po) that need to be generated are those 
whose status is either “processing” or “pending”. Moreover, for the “pending” 
purchase orders, the queries are only looking for the “customer” and “billTo” 
information, but do not need any “orderline” information. On the other hand, 
for the purchase orders whose status is “processing” , the queries need the “item” 
branch of “orderline” and the “customer” information. 

The aim of our mapping rewriting is to analyze a query set QS and a map- 
ping definition X and produce a custom mapping definition X' that provides 
sufficient data for all the queries in QS and, when materialized, does not generate 
“useless data”. For the above mapping and query set, for example, the algorithm 
will produce the mapping X' defined by the rewritten tagged tree depicted in 
Figure 3. 

There are several features of this modified tagged tree that we would like to 
point out: 

1. the query associated with the “po” node has been augmented with a dis- 
junction of predicates on the order status that effectively cause only relevant 
purchase orders to be generated; 

2. the query associated with the “billTo” node has been extended with a pred- 
icate that restricts the generation of this type of subtrees to pending orders; 

3. the query associated with the “orderline” node has been extended with a 
predicate that restricts the generation of these subtrees to processing pur- 
chase orders; 

4. the “qty” node has been eliminated completely, as it is not referenced in the 
query set. 

This rewritten DB-to-XML mapping definition, X' , when evaluated against 
a TPCW benchmark database instance, reduces the size of the generated data 
tree by more than 60% compared to the generated data tree for the original 
mapping X. 
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1.2 Main Contributions and Paper Structure 

Our main contribution is in devising a practical method to rewrite DB-to-XML 
mappings so as to reflect a client’s (Xpath) query workload and generate data 
likely to be relevant to that workload. We never ship more data than the naive 
ship X ( D ) approach. Realistic experimentation shows a very significant amount 
of data generation savings. Various optimizations, both at the mapping level 
and at the generated data level, are outlined. We also prove that generating the 
minimum amount of data is intractable (./VP-hard) . 

In Section 2 we define our DB-to-XML definition language, trees whose nodes 
are tagged with element names and, optionally, with a tuple variable binding 
formula or a column extraction formula. In Section 3 we show how an Xpath 
expression can be converted into an equivalent set of normalized queries that only 
navigate on child:: and self:: axes. Given a normalized query and a matching to 
the tagged tree, we show in Section 4 how to modify the original tagged tree 
so as to retrieve data that is relevant for one matching. Usually, a single query 
may result in many matchings with the tagged tree. We explain how rewritten 
tagged trees resulting from such mappings may be superimposed. A crucial issue 
is how to ensure that the superimposition does not result in loss of selectivity. We 
explore two basic kinds of Optimizations: in Section 5 we focus on modifications 
to the resulting tagged trees so as to further limit generated XML data. Section 6 
presents our experimental results. We consider a realistic query workload and 
show that our method results in significant savings. Conclusions and future work 
are in Section 7. 



1.3 Related Work 

We have recently witnessed an increasing interest in the research community 
in the efficient processing of XML data. There are essentially two main direc- 
tions: designing native semi-structured and XML databases (e.g., Natix [KMOO], 
Lore [GMWOO], Tamino [SclrOl]) and using off-the-shelf relational database sys- 
tems. 

While we do recognize the potential performance benefits of native storage 
systems, in this work we focus on systems that publish existing relational data as 
XML. In this category, the XPERANTO system [SSB + 00,CKS + 00] provides an 
XMLQuery-based mechanism for defining virtual XML views over a relational 
schema and translates XML queries against the views to SQL. A similar system, 
SilkRoute [FMS01] introduces another view definition language, called RXL, and 
transforms XML-QL queries on views to a sequence of SQL queries. SilkRoute is 
designed to handle one query at a time. Our work extends SilkRoute ’s approach 
by taking a more general approach: rather than solving each individual query, 
we consider a group of queries and rewrite the mapping to efficiently support 
that group. 
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Fig. 4. A simple tagged tree 



2 A Simple Relational to XML Mapping Mechanism 

In this section we introduce a simple DB-to-XML mapping mechanism called 
tagged trees. 

Definition 1. Consider a relational database D. A tagged tree over D is a tree 
whose root node is labeled ‘‘Root” and each non-root node v is labeled by an XML 
element or attribute name and may also have an annotation, v .annotation, of 
one of the following two types: 

— x := Q, where x is a tuple variable and Q is a query on D; intuitively, x is a 
tuple ranging over the result of query Q; we say that this node binds variable 
x; a node cannot bind the same variable as bound by one of its ancestors but 
Q may make references to variables bound by its ancestor nodes; we refer to 
Q as v. formula and to x as v.var; 

— x.C, where C is a database column name and x is a variable bound in an 
annotation of an ancestor node; we call this type of annotation “value anno- 
tation” , because it assiqns values to nodes; in this case v. formula is defined 
as x.C^ NULL. 

A tagged tree T defines a DB-to-XML mapping Xt over a relational 
database; the result of applying the mapping Xt to a database instance D 
is an XML tree XT = Xt{D), called a data tree image of T, inductively defined 
as follows: 

1. the root of XT is the “document root” node; we say that the root of XT is 
the image of the root of T. In general, each tagged tree node expands into a 
set of image nodes in XT-, 

2. if T is obtained from another tree T' by adding a child v to a node w of T' 
then: 
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a) if v has no annotation, then XT is obtained from XT' = Xt'(D) by 
attaching a child node, with the same label as v, to each image node of 
w in XT'; the set of these nodes are the image of v in XT\ 

b) if v has a binding annotation x := Q and Q does not contain ancestor 
variable references then XT is obtained from XT' as follows: for each 
node in the image of w, we attach a set of \Q(D)\ nodes with the same 
label as v: each of these nodes corresponds to a binding of the variable x 
to a specific tuple in Q(D) (called the current binding of that variable); 

c) if v has a binding annotation x := Q and Q contains ancestor variable 
references, then for each node in the image of w there are current binding 
tuples for the variables that are referenced in Q; we replace each variable 
reference y.C by the value of the C column in the current binding of y 
and proceed as in the case with no variable references; 

d) if v has a value annotation x.C , then we attach a child node, with the 
same label as v, to every image of w in XT' and set the text value of 
each such new node to the value of the C column in the current binding 
tuple for x. 

The data tree image can be generated by a simple depth-first traversal of 
the tagged tree and separate execution of each query. However, this evaluation 
strategy is very inefficient, especially if the queries return many results, since it 
involves a large number of requests to the DBMS. A better strategy is to combine 
all the queries into a single “sorted outer union” query and use the result to 
produce the XML document. Since the resulting tuples come in the same order 
as the final XML document no in-memory buffering is necessary and the result 
can be pipelined to a tagger component (as in XPERANTO [SKS + 01]). 



3 Xpath Expression Normalization 

Xpath [XP] is a simple yet powerful language for querying XML data trees. 
In particular it includes “traversal steps” that traverse a number of edges at 
a time (for example descendant-or-self) and ones that utilize non-parent child 
edges (such as preceding- sibling and following-sibling). Even more complex are 
the following and preceding axes. Our goal in normalizing the Xpath expression 
is to bring it to a form in which the only allowed axes are child:: and self::. To 
illustrate our ideas we first define the simple Xpath expressions (SXE) fragment 
of Xpath and later on extend it. The grammar for the main fragment we treat, 
SXE, is defined as follows: 



EXP 


:= Root Root RLP 


RLP 


:= Step Step l /’ RLP 


Step 


:= AxisName NodeTest 




Predicate* 


Predicate 


:= f PredExp “]’ 


PredExp 


:= RLP RLP 7 self::node()’ Op Value 




| ‘self::node()’ Op Value 




